Outdoing Hadoop: Facebook Scales Apache Giraph to a Trillion Edges
Raise your hand if you’ve heard of Apache Giraph. OK, we admit it, up until earlier this week we hadn’t either. But we bet Big Data wranglers worldwide are taking a good look at it today. Why? Because Facebook chose it over Apache Hive (a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis) and Apache GraphLab (a graph-based, high performance, distributed computation framework that was first developed for machine learning tasks and is now used for a variety of other data-mining tasks) to bring a new style of search to its 1.15 billion users.
In order to do this, they had to make a good number of improvements to Apache Giraf. The result? It can now scale to a trillion edges which, according to Facebook engineer Avery Ching (see his blog post ), was impossible last year.
What does this big advancement mean for Apache Hadoop? We’ve seen a few headlines that say things like “Move over Hadoop,” but we don’t think Hadoop’s going to become second fiddle anytime soon.
Why? First because not many companies want to, or need to, wrangle as much data as Facebook. Second because not many companies (in fact maybe no company other than Google) employ as many gifted engineers (I count at least 150) as Facebook and, at least for now, it's going to take a lot of brain-power to handle Giraph. And third, Giraph is nowhere near Enterprise-ready, and Hadoop is. Finally, while Enterprises may have problems working with Hadoop, scaling isn’t one of them.
What’s So Special About Lucene/Solr's Newest Commiter?
Talk about leaning in, Apache Lucene/Solr’s newest core-commiter is female! Out of the 43 individuals in this prestigious group, Cassandra Targett is the only woman.
From what we can tell, she didn’t have to break through a glass ceiling to claim her place, instead she used a great deal of her talent and time to create the Solr reference guide and to update it after each release.
While Targett did much of this work as an employee of LucidWorks, now that the company has donated the Lucene/Solr documentation to the Community, she’s “volunteering” there. She not only helped get the documentation hosted by Apache but she also pitches in to create documentation when the developers don’t have time.
And, it’s worth noting that unlike most Apache committers, Targett doesn’t code.
So, maybe the news around Targett is not just that she’s a woman, but that the open source Lucene Solr community has awarded a prestige status to someone whose role is documentation. Imagine that!
And, in case there’s anyone who isn’t familiar with Apache Lucene/Solr, it’s (and these are Targett’s words),
Lucene is a full-text search library written in Java. It's super-fast and easily scalable, making it perfect for high-performance indexing needs. Solr is built on Lucene, but has additional features that allow it to be more of an out-of-the-box search server. It inherits Lucene's power and extends it to be incredibly flexible for the needs of nearly any search application.”
Why is Lucene/Solr significant to us? Because the proprietary vendors who provide similar technologies would add to the price tag of using something like it (Autonomy, Verity, FAST)with Documentum; Apache Lucene/Solr, in and of itself, would not, because it is Open Source. Second, proprietary vendors can do with their technologies what they want and that’s not good for you when you depend on them. Consider that EMC Documentum used to use FAST for search until Microsoft acquired it, then it could not. That’s not going to happen with an open source technology.
Continuuity Helps Partner Reap Real-Time Results
Remember Continuuity, the start-up that promises to make Hadoop accessible to the rest of us (i.e. software engineers who don’t work at Facebook, Netflix, Eventbrite, eBay …)?
This week they announced that they will integrate Continuuity Reactor (their PaaS offering for Big Data) with Crowd Control®, Lotame’s data management platform (DMP). For Lotame this provides two big benefits: first, they will be able toprovide their customers with purpose-built applications to drive real-time insights into their data so that they can make more informed decisions and realizebetter ROI on their data assets; and second, because Continuuity’s platform, Continuuity Reactor, takes care of all of the loathsome work that has to be done before a Hadoop application can even be written, their developers will be able to spend their time on work that has an impact.
CSC Ups Its Interest in the Big Data Game via Hortonworks and Infochimps
Most of us don’t think of “Big Data” when we think of CSC, and for CSC that needs to change. After all, the word on the street (whether you’re on Sand Hill Road in Silicon Valley or on Main Street in Plymouth Vermont) is that companies who don’t get it right with Big Data are going to perish.
Now if you’re an apple grower or a landscaper that plants shrubs at country clubs, maybe doing Big Data can wait. But if you’re a technology services provider like CSC, Big Data better be a BIG part of your game or you’re doomed.
Now while we’re not suggesting that CSC lacks a Big Data practice (we’ve checked and they have one), they’ve done two things in the past ten days that might help them put a bigger mark on the map. First, they announced a reseller partnership with Hortonworks and second, last week they acquired Infochimps.
While the Hortonworks announcement doesn’t come as much of a surprise because Hortonworks’ strategy seems to center around pollenating the planet with its flavor of Hadoop -- HDP (Hortonworks Data Platform) -- and CSC can help them to do that by recommending it to its customers,there is some subtle but bold language in Hortonworks’ announcement that caught our eye. Consider this quote from VP of Business Development Mitch Ferguson:
We look forward to expanding the enterprise adoption of 100-percent open source Apache Hadoop (note Hortonworks considers itself to be the only 100% Open Source player in this space, a claim that its competitors would certainly challenge) across the globe and ensure its place (can you say “dominate”) in today’s modern data architecture.”
The other CSC announcement, the purchase of Infochimps, leaves us a little sad, not for any other reason than because it likely represents the end of a dream. All too often, innovation tends to cease once a company has been acquired. And from what I’ve seen of the Infochimps team, these are creative, fun-loving geeks who aren’t into suits, policies and procedures, and CSC is full of those. Their enthusiasm could wane for two reasons -- cultural fit and money in the pocket (the latter of which wouldn’t be a horrible way to end a dream if you’re tired of your vision and just want to cash out).
It’s worth noting that for Jim Kaskade, who took over as CEO not too long ago, this probably represents a big boon. Maybe he was brought in specifically to pump Infochimps up and prime it for sale or maybe he saw that the possibilities of getting to an IPO someday were out of line with reality.
In either case, here’s hoping that Infochimps’ founders will use their genius, energy and interest in big data and innovation well, whatever the future brings.
Title image courtesy of Feng Yu (Shutterstock)