It takes a community to raise an elephant -- that was the warm, fuzzy sentiment at the Hadoop Summit in San Jose this week. And the elephant, otherwise known as Apache Hadoop, is growing up quick.
Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. (Hadoop, by the way, isn’t an acronym of any kind; it got its name from a toy elephant that one of Hadoop’s creators (Doug Cutting) son owned.)
For the laymen among us, Hadoop is one of the technologies that web properties like Amazon, Facebook, Netflix and LinkedIn use to predict what products we are likely to buy, what ads we are most likely to click on, what movies we might want to see, and whom we might know.
It’s worth emphasizing that the Hadoop framework is Open Source, meaning that it’s free to use and built by a community of passionate engineers whose code is heavily scrutinized before it is accepted into the Apache Hadoop project. (More simply stated, a bunch of passionate experts who aren’t compensated in any way, have to agree that the code is damn good, can stand the test of time, and that it works.) The goal of the community is to drive continuous innovation into the core of Hadoop and to encourage its adoption.
The Hadoop Summit is a gathering of the Hadoop Community; here, individual Hadoop enthusiasts, companies who build products and deliver services around Hadoop and commercial enterprises who use (or are exploring the use of) Hadoop gather to share ideas, innovations, success stories and lessons learned.
Though it might sound a bit odd, to the capitalists among us, that a bunch of geeks (some of whom clearly want to strike it rich) would gather together to give their best thinking away, it’s happening. And though the vendors in this space don’t necessarily sing praises for each other’s business models, they do share code, though some contribute far more than others.
It’s interesting to note, that two bright lights (among many others) on the “contributing” front of Hadoop, come not from vendors (their bright lights to follow), but from commercial entities whose business models do not involve raising revenue around Hadoop.
One such bright light comes from InMobi, a mobile advertising company. When Srikanth Sundarrajan, a Principal Architect at the firm, ran into a data lifecycle management problem that Hadoop didn’t handle, he built the solution himself. And while he, and his employer, could have kept the code that solved their problem for themselves, they contributed it to the Apache Hadoop Project so that others, who faced similar problems, could benefit from their work. Sundarrajan then worked with a group of Hadoop commiters to build Project Falcon, which is now part of the Apache Hadoop incubator project. You can read more about it here.
Netflix, which employs an army of Open Source enthusiasts, did something similar. When they encountered difficulties running Hadoop workloads in the cloud, they built Genie, an architecture that provides job and resource management for the Hadoop ecosystem in the cloud. They Open Sourced the framework last Friday.
I point out these examples because they represent Open Source at its best. It’s a “We build it and everyone can use it" mentality.
In a perfect Hadoop World it would all work this way, all of the time; and at the Hadoop Summit it does. Here are a few announcements and topics we found interesting.
Hadoop Gets A New Operating System - YARN
As bleeding-edge as Hadoop seems to be, the reality is that it was built seven years ago at Yahoo. Its goal at the time? To search and index the web. Needless to say, it’s a huge success. Arun Murthy, who was a lead architect on Yahoo’s Hadoop Map-Reduce development team, at the time, was responsible for providing Hadoop as a service for all of Yahoo -- running on nearly 50,000 machines.