It takes a community to raise an elephant -- that was the warm, fuzzy sentiment at the Hadoop Summit in San Jose this week. And the elephant, otherwise known as Apache Hadoop, is growing up quick.
Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. (Hadoop, by the way, isn’t an acronym of any kind; it got its name from a toy elephant that one of Hadoop’s creators (Doug Cutting) son owned.)
For the laymen among us, Hadoop is one of the technologies that web properties like Amazon, Facebook, Netflix and LinkedIn use to predict what products we are likely to buy, what ads we are most likely to click on, what movies we might want to see, and whom we might know.
It’s worth emphasizing that the Hadoop framework is Open Source, meaning that it’s free to use and built by a community of passionate engineers whose code is heavily scrutinized before it is accepted into the Apache Hadoop project. (More simply stated, a bunch of passionate experts who aren’t compensated in any way, have to agree that the code is damn good, can stand the test of time, and that it works.) The goal of the community is to drive continuous innovation into the core of Hadoop and to encourage its adoption.
The Hadoop Summit is a gathering of the Hadoop Community; here, individual Hadoop enthusiasts, companies who build products and deliver services around Hadoop and commercial enterprises who use (or are exploring the use of) Hadoop gather to share ideas, innovations, success stories and lessons learned.
Though it might sound a bit odd, to the capitalists among us, that a bunch of geeks (some of whom clearly want to strike it rich) would gather together to give their best thinking away, it’s happening. And though the vendors in this space don’t necessarily sing praises for each other’s business models, they do share code, though some contribute far more than others.
It’s interesting to note, that two bright lights (among many others) on the “contributing” front of Hadoop, come not from vendors (their bright lights to follow), but from commercial entities whose business models do not involve raising revenue around Hadoop.
One such bright light comes from InMobi, a mobile advertising company. When Srikanth Sundarrajan, a Principal Architect at the firm, ran into a data lifecycle management problem that Hadoop didn’t handle, he built the solution himself. And while he, and his employer, could have kept the code that solved their problem for themselves, they contributed it to the Apache Hadoop Project so that others, who faced similar problems, could benefit from their work. Sundarrajan then worked with a group of Hadoop commiters to build Project Falcon, which is now part of the Apache Hadoop incubator project. You can read more about it here.
Netflix, which employs an army of Open Source enthusiasts, did something similar. When they encountered difficulties running Hadoop workloads in the cloud, they built Genie, an architecture that provides job and resource management for the Hadoop ecosystem in the cloud. They Open Sourced the framework last Friday.
I point out these examples because they represent Open Source at its best. It’s a “We build it and everyone can use it" mentality.
In a perfect Hadoop World it would all work this way, all of the time; and at the Hadoop Summit it does. Here are a few announcements and topics we found interesting.
Hadoop Gets A New Operating System - YARN
As bleeding-edge as Hadoop seems to be, the reality is that it was built seven years ago at Yahoo. Its goal at the time? To search and index the web. Needless to say, it’s a huge success. Arun Murthy, who was a lead architect on Yahoo’s Hadoop Map-Reduce development team, at the time, was responsible for providing Hadoop as a service for all of Yahoo -- running on nearly 50,000 machines.
This first generation of Hadoop was a purpose-built system for web-scale data processing, according to Murthy. And it’s great for those purposes.
But as more and different kinds of companies show interest in Big Data, they want to do other things with Hadoop and the MAP Reduce paradigm which Hadoop 1.0 uses wasn’t built for that. Murthy foresaw this three years ago, so he and a team of Hadoop committers, began working on Project YARN.
YARN enables a wide range of data processing applications to run natively in Hadoop with predictable performance and quality of service. The YARN-based architecture of Hadoop 2.0 will enable organizations to take Hadoop beyond its batch-processing roots and deploy specific applications designed to derive greater business value from critical, and often untapped, information. YARN will make Hadoop a platform.
It’s worth noting that YARN is an Apache Foundation based project, it’s not owned by Murthy’s employer; in fact, Murthy worked at Yahoo when the project began, he works at Hortonworks (where he is a co-founder) now. That being said, Hortonworks will make YARN Enterprise-ready and companies who want to use it will be able to get it from them for free. It’s part of Hortonworks’ Data Platform 2.0 (HDP 2.0).
For now, HDP 2.0 is in Community Preview and pre-beta, which means that it’s not quite ready for prime-time. That being said, Hortonworks is offering a certification Program for Apache Hadoop YARN which is is designed to support the Apache Hadoop ecosystem behind this next-generation architecture by helping application developers build and certify their applications to use the YARN architecture of Hadoop 2.0. Hortonworks says that participants in this program are instrumental in the testing and delivery of this new framework and are provided access to the latest developments and direct interaction with the Apache community of developers building YARN. Interested application developers should visit this link for more information.
YARN was the big talk of the Hadoop Summit.
For those who are wondering how Hortonworks will make its money if its employees are working on Hadoop and then giving their Hadoop platform for free, there is an answer. By providing services to companies who use, or plan to use, Hadoop and through training. As I understand it, they see this as a sound strategy for two reasons; first, they believe in the spirit of Open Source and that a community working together can produce a better product (distribution/framework, platform) than a team of employees working at a single firm AND that charging for community built product (even if you’ve, as a business entity, made improvements to it) is misaligned with that spirit; and second, that because they employ more Hadoop committers (those who commit code to the Hadoop project), they know Hadoop best and are in the best position to provide services around it.
Hadoop Queries Made Easy For MicroStrategy Customers
Talk to some people and they’ll tell you that getting insights out of Hadoop is a bear. That you’ll need a team of developers who know Java, MapReduce and how to present results that look like SQL.
It doesn’t have to be that hard, says Scott Capiello, Vice President of Product Management at Business Intelligence Software provider MicroStrategy, Inc.
At the Hadoop Summit a team of engineers from Yahoo’s e-Commerce group in Taiwan showed how they provided their users with the ability to run queries on data in Hadoop through a MicroStrategy Hbase connector.
These MicroStrategy connectors exist for various flavors of Hadoop including STINGR, Impala, HAWQ, IBM’s Big SQL etc. Though the MicroStrategy solution may not be for everyone, in every case, and it works best with lower volumes of data, it’s a big win for companies who don’t have the manpower needed to go another route. Not just that, but for companies who know which data to bring in-memory, it represents a big win because users will be able to query MicroStrategy from anywhere, at any time, and get their answers on any device.
There’s another thing worth noting about MicroStrategy’s embrace of Hadoop; it illustrates how Hadoop is going mainstream. Teradata’s announcement this week echoes the same sentiment.
Hadoop is crossing the chasm, as the folks from Hortonworks repeated over and over again during the Hadoop Summit. The baby elephant is growing up.
Title image courtesy of Johan Swanepoel (Shutterstock)