If Hadoop Disappears Will the Label on Your Distro Matter

Olson has a way of saying things that cause a reaction. Last year it was his introduction of the Enterprise Data Hub that took everyone off guard, and this year it’s the promise that the big data muncher named after an elephant in a storybook will practically vanish from the line of sight of all but a few geeks.

You’d think that if it were invisible and it was all open source then the brand of Hadoop used might not matter so much.

OK, that’s oversimplifying it a bit too much, so let’s uses Gartner analyst Svetlana Sicular’s words from a recent blog post:

There are more similarities than differences among commercial Hadoop distributions: All Hadoop distributions include the core open-source Apache Hadoop projects, many other open-source projects and a smaller set of distribution-specific components. Most distribution-specific components deliver functionality comparable with functionality of other distributions. This makes vendor lock-in concerns ungrounded for the majority of use cases.”

So what it comes down to then might not be so much which distro best meets your IT requirements, but which vendor has the resources in place to support you and who will remain in the aforementioned position as Hadoop continues to evolve.

And trust me, it continues to evolve. (Hortonworks distro, Hadoop 2.2 released earlier this month had over 100 new features and enhancements, for example.)

Olson contends that the days of having to worry about the geeky stuff (the Pig and Flume and Sqoop) are over (and the vendors who sell you software and support will have your back). And that the future is about building applications on top of Hadoop that deliver business value.

The World is Not Yet Hadoopified

That is a future that has yet to be seen, whether Olson and Silicon Valley realize it or not. It’s important to note that the world that the distro providers -- and even some analysts -- live in is one in which executives are already working with Hadoop or seriously looking into it. They're not thinking of the ones who are buried under an onslaught of information that they aren’t sure they can handle (so they’re avoiding it by tackling other projects first). This group is going to need not only a quality “product” but a whole lot of hand holding.

And it’s here that the three primary Hadoop distro providers offer very different products. MapR is a software company that uses many open source products in its distro. It's going to deliver a Hadoop product that’s as close to plug’n’play as they come.

If MapR delivers as promised, your services bill won’t be anywhere near as high because your employees should be able to work with it. What you might sacrifice is being as close to the leading/bleeding edge as you might be with Hortonworks or Cloudera. (MapR typically announces its adoption of the newest technologies a month or months after the others. And MapR’s customers may not care because they’re not looking to move to the latest release and onto the edge in short order anyways.)

Hortonworks Hadoop distro, HDP, is 100 percent open source. The employees of the company contribute heavily to the Apache open source projects that make up HDP and they believe that a community of passionate engineers can produce better products than employees working isolated within the walls of any single vendor. They make their money selling support for HDP and they spend a great deal of time and resources in open source development that is shared with all. Their strategy is “if we built Apache Open Source Hadoop and the technologies which surround it, we are in the best position to help you reap value from it and support it.”

Cloudera’s model is a mix of the aforementioned. Much of their distro is free and open source but the value adds, which are open source and free at Hortonworks, are proprietary and you must pay for.

We should note that this marketplace is not about what’s free and what’s not but about whose “product” will best meet your needs and offer a competitive advantage.

Check What's Under the Hood

And while vendor lock-in via proprietary products might have been the bogey man of the past, the value now is more about having the best, most reliable big data solution in place so the that business can glean insights and leverage them faster than the competition.

Sicular alludes to the same point in her blog post:

For many organizations, big data initiatives are the cutting edge of their innovation. Talented and experienced distribution vendors are often not just service providers but innovation partners and the source of new ideas in the enterprise.”

In other words, it might behoove you to look not only at the Hadoop distros when you shop for vendors, but also the data scientists and business analysts that they employ who can help you reap insights that deliver business results. (Note: If LinkedIn is right, not one of these vendors employs more than three data scientists.)

And finally, just as a fun way to compare Hadoop distros, we have a word cloud to share based on the Cloudera’s and Hortonworks' latest release blog posts. (MapR has not released its update yet.)



Title image by Sarah Wiseman (Flickr) via a CC BY-NC 2.0 license