you can check in, but you can't check out

If you have anything to do with computer technology larger than an iPad, you know that "big data" is the "dot com" for a new generation of programmers, analysts and venture capitalists. 

Hadoop and its family of tools serves as an ideal data store for pure number crunching, without the overhead of SQL. For those data scientists who study the raw data to identify trends, spot anomalies and otherwise enable data based decisions, Hadoop is a solid tool.

Written in Stone

Hadoop includes two different functions: map reduce, a way of making storage more efficient; and the Hadoop Distributed File System, or HDFS.

The map reduce process works on data in a very similar way to what a search platform does when it creates an index; but in my opinion not as effectively.

HDFS is, in fact, a file system. It uses commands very much like those you find in Linux distributions but with the bonus of more typing — if you know Unix or Linux shell commands, you’ll feel right at home. The problem is you’ll keep forgetting to precede every command with "hadoop fs." Consider the commands to list files in the current directory in each system:

  • Linux: ls –l
  • HDFS: hadoop fs ls –l

You can probably get used to preceding every command with two extra key words. What you might have a problem with is that the HDFS is “immutable.” This means you cannot edit a file after it has been written — no updates. Hope is on the horizon in the form of other big data tools like Hive, but for now… it’s written only once.

Bye Bye Content

As Hadoop gathers buzz, the folks who market it are straining to find even more purposes for their products. They’ll tell you that Hadoop is not just for data scientists any more. This has led to widespread use of the term “Data Lake.” What's a data lake? TechTarget tells us, “A data lake is a large object-based storage repository that holds data in its native format until it is needed.” 

Sounds like a file share, yes?

What does this mean for data lakes? Well, it strikes me that in the push to find further uses for their products, companies that sell and support Hadoop distributions as well as companies that make and sell disk drives, are doing so without regard for any real world logic. You have to wonder if they just want to push more nodes, regardless of what can actually be used.

It also should tell you that any content you write on HDFS can never change. That means that the text, the title, the author and the rest of its metadata will remain as is forever — or until you move it to a read/write file system and re-index it. Your documents can check in, but they'll never check out.

And when a document is archived to a giant shared drive there’s a good chance it will never been found again. I’ve seen a few projects that went back after the fact to assign metadata to shared drive documents with tools like Smartlogic, Concept Searching, Expert Systems and others offer.

What if you have want to search-enable a repository (that you know will never change) and you want to index it anyway? If you want to ever find the content again, there are a few things you should do before you banish the content to HDFS:

Teach your users what metadata is, and why it helps their content be discovered. Encourage then to apply metadata as they create their documents. Average users may not know what metadata can do, but if you can convince them that metadata enables them to find their own content again later, you’ve got a chance. And if you have a team that creates content for distribution, those professional content creators should be required to provide good metadata.

Worst case, use humor: point them to my blog post “Sixty guys named Sarah,” and they may come to understand that bad, missing or just wrong metadata can make search bad, useless and wrong.

And whatever you do, don’t take the advice I heard one Hadoop-distro company suggest to a customer. They claimed that HDFS is an ideal place for search indices. But can you imagine how often your average search platform updates its indices as it ingests thousands of new documents?

If you have a huge set of data that never changes, and will only need to be indexed once, you might be able to put up with the abysmal indexing time. But if you ever decide to add even a single document — be prepared for a very long index run as your disk thrashes around in an attempt to update the index. Or you could save some money on disk drives and just delete all of the old content: the effect is the same.

Creative Commons Creative Commons Attribution 2.0 Generic License Title image by  Grand Canyon NPS