Why Does Apache Spark Matter?

Apache Spark is arguably the hottest big data technology of the year — or maybe ever.

More than 1000 enthusiasts have committed code to the open source project and almost every big data provider has embraced it, and it’s just 20 months old.

IBM has said that it will educate one million data scientists and data engineers in Spark. Databricks, whose founders created Spark, has already trained 20,000.

This week 1,200 of Spark’s early adopters are gathered in New York City at the Spark Summit East.

It’s a chance for them to learn from each other and from Matei Zaharia who created Spark as a student at UC Berkeley.

What Is Spark?

We asked Peter Schlampp, vice president of product at big data discovery platform Platfora, to explain in plain English:

“Apache Spark is a rapidly evolving open source engine for large-scale data processing and analytics. Businesses that rely on Hadoop need a variety of analytical infrastructures and processes to find the answers to their critical questions.

"They need data preparation, descriptive analysis, search and more advanced capabilities like machine learning and graph processing. Also, businesses need a toolset that meets them where they are, allowing them to leverage the skillsets and other resources they already have.

"Until now, a single processing framework that fits all those criteria has not been available. This is the fundamental advantage of Spark”.

Let’s Get Excited

Why is Spark such a big deal? NBC uses it to discover content offline, Toyota uses it to improve customer experience, Airbnb uses it to predict demand.

It holds the potential of catching cybercriminals in real-time and even to eliminate the buffering on your screen when you’re streaming something.

But enough already. I asked some of the smartest people whose companies are at the Spark Summit “Why does Spark matter?” Here’s what they said:

Grant Ingersoll, CTO, Lucidworks

As data sizes grow, it is imperative that developers have the means to build applications that can scale to match that growth.

Spark makes it easier for developers to build these applications by simplifying the programming model and taking care of a lot of the things that can go wrong in distributed applications.

Kavitha Mariappan VP Marketing, Databricks, Inc.

Spark enables enterprises to process large amounts of data faster than ever, and dramatically simplify their infrastructures by obviating the need to integrate a disparate set of complex tools.

Spark goes beyond batch computation and provides a unified platform that supports streaming, interactive analytics, and sophisticated data processing such as machine learning and graph algorithms.

It is fast, as it has been built from ground up to process data in memory. Spark’s optimizations however extend beyond memory.

Scott Gnau, CTO,Hortonworks

Since Spark combines SQL, streaming and complex analytics, it offers broad compatibilities across multiple technology domains — a key advantage for running analytics against diverse data sources.

Because it is capable of in-memory compute, Spark opens up a lot of possibilities for data science (iterating through models at the speed of thought), machine learning (sophisticated processing with memory speeds) and streaming types of workloads.

Peter Milne, Senior Sales Architect, Aerospike

Big data analytics is about dividing the signal from noise — finding patterns in data. Spark is a massively parallel analytics engine with very nice built-in machine learning, which is great for learning from the data and making predictions from the data, and for graph processing, which is being used to power targeted advertising and social apps.

Jim Scott, Director, Enterprise Strategy and Architecture, MapR

Hadoop MapReduce was effective at providing a way to solve problems at scale, but it isn't easy to use and wasn't the fastest solution for all use cases. Spark is much easier to use and has been a catalyst for competition in the distributed compute engine space.

Spark has a plethora of configuration options or levers, if you will. The best way to put this is with an analogy. Spark is like a fighter jet that you have to build yourself. The great thing about it is that after you build it then you have a fighter jet, which is pretty cool. But you still have to spend some time learning how to fine-tune (fly) that fighter jet.

Nick Halsey, CMO, Zoomdata

Spark is a critical component for building real-time and streaming analytic applications. It provides a layer between the data source and the visualization for high-performance analytics at scale, and takes advantage of an elastic and distributed architecture for optimal performance and manageability.

Peter Schlampp, VP Products, Platfora

Spark provides a framework for advanced analytics right out of the box. This framework includes a tool for accelerated queries, a machine learning library, a graph processing engine and a streaming analytics engine.
Spark makes everything easier. Instead of requiring users to understand the various complexities, such as Java and MapReduce programming patterns, Spark is made to be accessible to anyone with an understanding of databases and some scripting skills (in Python or Scala).
Spark accelerates results. It provides parallel in-memory processing that returns results many times faster than any other approach requiring disk access. Instant results eliminate delays that can significantly slow incremental analytics and the business processes that rely on them.
Spark doesn’t care where or how you store your data. All of the major Hadoop distributions and Cloud service providers now support Spark, with good reason. Spark is a vendor-neutral solution, meaning that implementation doesn’t tie the user to any one provider.

Because Spark is open source, businesses are free to create a Spark-based analytics infrastructure without having to worry about whether they might change Hadoop vendors at some point down the road. If they change, they can bring their analytics with them.

Gary Orenstein, CMO, MemSQL

Spark solves a number of data processing challenges in one system. And it does so quickly using a memory-optimized architecture. It does not do everything however.

Learning Opportunities