don't look now

A recent trend gives cause for concern: Companies are using their big data repositories — Hadoop, Storm, "DBs," "Bases," etc. — as archival storage, turning them into little more than glorified network shared drives. The industry even has a term for it: data lakes.

We've already maintained here that data lakes are the new "Roach Motel" — your data checks in, but never checks out. The problem? Metadata and content errors become permanent in the data lakes and as a result, files are forgotten.

Let’s Talk Data Quality

In spite of the volume of data we’re looking at in discussions of big data, generally speaking, humans don’t have to do much to enhance the quality of the content. The sources feed it in a relatively predictable format. For example, if you wanted to log phone calls, you’d know you’ll have a caller and recipient number, a timestamp, duration — all kinds of predictable, standard fields and formats. Even with server logs or output from other instrumentation, the content is pretty regular and predictable. We call this "easy data" for searching.

But big data apps bring with them issues of data quality. One of the oldest sayings in programming should be applied to the big data context: “Garbage In, Garbage Out."

Enterprise content is different. While originating in only a handful of formats, there is virtually no standard for metadata. And even formats that include a standard set of fields, like Microsoft Office documents, have issues with bad values. As my article “Sixty Guys Named Sarah” says, “poor metadata makes even the best of search platforms perform poorly.”

Small errors can be easy to ignore or averaged away in numerical processing. But once you start looking at a large number of data errors, it becomes a question of risk.

Let’s Talk Discoverability

Beyond the immutability of metadata errors in a data lake, files become forgotten. Your search team may index the repository, but these “file shares” are rarely part of standard enterprise search instances. In a way, they are forgotten. And forgotten content can be risky: things may turn up in discovery that shouldn’t have, or be held beyond its retention policy. Both of these are liabilities you want to avoid.

Because this forgotten content is sometimes searchable only in archival repositories, it is hard to find. And when people turn up that hard to find document, what happens? I personally save a copy on my laptop or to my TeamSite so I can find it quickly the next time I need it. It multiplies. And maybe gets changed. Which means it’s harder to search and harder to find.

What’s my solution? A few points:

Before archiving content, make sure it’s got accurate metadata. If you don’t have a process for adding metadata to the original, consider licensing one of the open source and commercial tools that will identify names of people, places and things in the body of the document, and add that accurate metadata to your content.

Make it easy to search all of your content repositories. Modern search platforms can reduce the relevance of certain repositories, but it may be better to associate archived content using “archived” as a tag or facet. Consider providing archived results if a user query produces no hits on active repositories.

Finally, put processes in place to review content access rights, document retention, questionable content and security tags in your active content to prevent running into any legal, ethical or moral problems.

Do that and you just might succeed at checking out from the roach motel.

Creative Commons Creative Commons Attribution-Share Alike 2.0 Generic License Title image by  Anne Worner