Databricks Opens Its Spark(ly) Gates to All

At the Hadoop World conference in New York City last fall, much of the excitement was around another big data crunching technology —Apache Spark.

Full house with people standing during Spark talk. Next year it can be called Spark-World. #hadoopworld #strataconf pic.twitter.com/cBjJ934OYC
— Andre Luckow (@drelu) October 17, 2014

It’s a fast and general engine for processing large data that can be as much as 100x quicker than the MapReduce component of Apache Hadoop. It is also more flexible and easier to use, according to Patrick Wendell, a cofounder of Databricks. Wendell was a member of the team that developed Spark at the University of California, Berkley.

The researchers there took what they had built at Berkley and donated it to the Apache Foundation. They then left academia to found Databricks, a commercial entity that provides a cloud platform for users of the now open source Apache Spark.

Logical Transition

Was the leap from the relative safety and security of an ivory tower to Silicon Valley where more startups fail than succeed a big risk? Hardly, according to what Wendell had to say.

After all, there wasn’t much question whether crunching big data was a big deal — what was, and still is, is how quickly and simply it can be done by more people and what real world impact that might have.

Hadoop and other potentially game-changing technologies that have been widely available until now weren’t going to be the best answer, according to Wendell.

“They’re too hard to use," he explained.

“And if we find them hard to use,” he added, referring to himself as well as some of the other elite members of UC Berkley’s AMP Lab where Spark was first created, then the chances of them being widely adopted were limited.

This is not to say that Spark necessarily diminishes Hadoop's prospects.

The Hadoop camp takes a “better together” approach when it comes to Spark. So much so, in fact, that engineers from all three Apache Hadoop distro providers — Cloudera, Hortonworks and MapR —will be presenting at the Spark Summit, which opens in San Francisco today.

Databricks Opens the Floodgates

But it’s doubtful that the Hadoopers will steal the show because Databricks will drop two big pieces of news just as the conference begins. The first is that the Databricks Cloud platform is now generally available, it has been in private beta until now.

This means that data workers who have been chomping at the bit to get busy with Spark will now be able to do so on Databricks’s Cloud platform almost as easily as pushing a button.

There’s nothing to download, no hardware to provision, they can simply log in and launch a cluster. Perhaps better yet, data scientists and developers won’t even need to learn a new language. Databricks Cloud platform plays nicely with Python, Java, Scala, and, as of today, data scientist favorite R. Databricks is sold by monthly subscription.

IBM Embraces Spark Too

Early this morning IBM broke the news that it’s teaming with Databricks and the Apache Spark Community to contribute key machine learning capabilities to the Apache Spark Project. The contribution will come in two parts.

First, IBM will commit 3500 researchers to the open source big data project. It’s a big number by any measure, but considering that Databricks lists fewer than 50 employees on its website and that the Apache Spark community is made up of a healthy 500 members from 200 different companies, it’s huge. In fact, if all 3500 IBMers came to the Spark Summit, then they’d probably need a different venue.

But that’s not all that IBM announced. It also plans to teach Spark to more than 1 million data scientists and to host Spark on its Bluemix Platform as a Service. While this appears to be incredibly great news for the Apache Spark Community, the numbers are a bit overwhelming.

Big Potential

Even without the IBM news, Databricks and the Spark Community have changed the big data landscape. The Apache Hadoop community realized that this would happen early on and have been working to bring the best of both technologies to their customers.

But non-Hadoopers have recognized the potential impact of Spark as well.

Expect for developers to bring it into the enterprise when big data processing opportunities arise. Sure they’ll be test cases at first, but there’s not much question that they’ll fare well.

Not only did the 150 or so beta testers like what they saw, but also Wendell said that Databricks took their time to make sure they were ready before they opened their doors.

We’ll be watching to see what happens when you pair an eager community with access to game-changing, flexible and easy to use technology via a credit card…

Learning Opportunities