Self-service BI and analytics-for-all aren’t buzzwords anymore. 

We have now entered the modern analytics era, Gartner noted in its latest Magic Quadrant for Business Intelligence (BI) and Analytics. Here business users are willing to wait three seconds to get results from their BI queries, even if there are huge data lakes (aka Hadoop) to trudge through to come up with the right answers.

“If users don’t get responses that fast, they think something is broken and give up. Or worse, they launch the query again (which slows everything down even more),” Bruno Aziza, CMO at San Mateo, Calif.-based AtScale, told CMSWire.

So why not send the right query to the right engine from the start?

How it Works

Aziza said there are five main ways to run SQL on Hadoop: Apache Drill (used by MapR), Apache Hive (used by Hortonworks), Apache Impala, Presto and Apache Spark. Among those Hive, Impala and Spark predominate. AtScale took it upon itself to conduct benchmarks on the top three in order to assess their fitness when it comes to satisfying BI workloads:

  • Performs on Big Data: the SQL-on-Hadoop engine must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the order of 10s or 100s of seconds
  • Fast on Small Data: the engine needs to deliver interactive performance on known query patterns and as such it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows)
  • Stable for Many Users: Enterprise BI user bases consist of hundreds or thousands of data workers, and as a result the underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads

Using the classic Star Schema Benchmark (SSB) data set it used 6B rows, 13 queries, 3 patterns.

  • Quick Metric queries:  Compute a particular metric value for a period of time. such as, “How many hoverboards did I sell in Q4?” (Q1.1-Q1.3)
  • Product Insight queries: Compute a metric (or several metrics) aggregated against a set of product and date based dimensions, such as how many red hover boards did I sell in Q4? (Q2.1-Q2.3)
  • Customer Insight: Compute a metric (or several metrics) aggregated against a set of product, customer, and date-based dimensions, such as how many red hover boards did I sell to women over the age 40 in Q4? (Q3.1-Q4.3)

Here’s what it found:


Best Engine for the Job

According to AtScale’s benchmark, no single engine is best for all use cases. “Hive is the worst engine you can connect to,” Aziza said, noting that it can take 20 minutes to come back with an answer on some queries. He then backtracked a bit and clarified that Hive was a gem for data pipelines because it seldom crashed.

Learning Opportunities

Impala excels where there are lots of users (aka concurrent users) and Spark excels on smaller data sets where there are tables with thousands or several million rows of data.

He also noted that because all three are Apache, and therefore community-driven, projects, the capabilities of each are likely to advance. This already seems to be the case between Spark 1.5 and 1.6.

What’s the Angle?

Vendors rarely conduct benchmark studies for the heck-of-it. So what’s AtScale’s motive given that they weren’t benchmarking their own product?

“We have the information, our customers ask which engine is the best, so why not share it?” said Aziza.

While that’s honorable, there’s probably another agenda as well. One of the things that AtScale, whose product serves as a middleman between BI tools and Hadoop, does is automatically picks the right engine for the workload. In other words, it optimizes for performance.

It also provides much needed governance and security that Aziza said. Hadoop does not.