How Should Hadoop Users Think About Spark?

There’s been a lot of rhetoric around Apache Spark over the last few years. Back in the fall of 2014, some even suggested — only partly in jest — that the Hadoop World conference change its name to Spark World.

Spark, from a high level perspective, is a fast and general engine for processing large data. It can be as much as 100 times quicker than the MapReduce component of Apache Hadoop.

It is an Apache project with a large and vibrant community. To get more details about it, go here.

The Golden Child?

Spark has practically become the data world’s golden child.

Last June IBM announced it was going to put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide.

It also donated its IBM SystemML machine learning technology to the Spark open source ecosystem and promised to educate more than one million data scientists and data engineers on Spark.

In September Cloudera announced its plans to officially replace MapReduce with Apache Spark as the default processing engine for its Hadoop distro.

Cloudera competitors MapR and Hortonworks both support Spark in their greater offerings.

Spark at Right Size

Suffice to say, Spark is a big deal. And though it can’t replace Hadoop per se, Hadoop vendors, who work with it well, have a lot to offer their customers.

(Spark, we should note, doesn’t have to be tied to Hadoop. It can interface with a variety of other storage systems including Cassandra, OpenStack Swift, Amazon S3, Kudu or a custom solution).

One of the big questions I had for one of the creators of Spark in 2013, before Spark’s corporate sponsor Databricks had even announced its business model, is why anyone would still use Hadoop in the future. It wasn’t an official interview, so I’m not going to call him out on his answer, but “because they already have it,” he said.

“So it’s over for Hadoop? No new customers from this point forward?” I asked him.

That question has been answered. Customers are still buying Hadoop or Hortonworks wouldn’t be experiencing record growth.

The right question to ask now, according to Shaun Connolly, VP Corporate Strategy at Hortonworks, is “How do I unlock new value from Hadoop using Spark?”

We’ve seen MapR’s and Cloudera’s answers.

Yesterday Hortonworks announced plans to accelerate Spark at scale, using it as an analytics engine to get data from Hadoop, HDFS, Hive and others and to develop a new generation of modern data applications.

What should the conversation about Hadoop and Spark focus on in 2016?

It’s shouldn’t be a bunch of rhetoric, that’s for sure.

How about looking at Spark for what it does well and Hadoop for what it does well, so that new products, new applications and new revenues can be created?

Learning Opportunities