Wanna Race Cloudera Says Impala is Faster than Hive and Proprietary RDMS

Cloudera made a big splash at O'Reilly Strata + Hadoop World 2013 in New York City last October when it announced its Enterprise Data Hub strategy. It wants it to be the place where companies park all of their data, regardless of its format, and from which they can use BI tools, whichever they happen to be.

At the time, some of Cloudera’s Hadoop competitors were a bit taken aback — not by  the idea of an Enterprise Data Hub (MapR suggested it more or less already offered one. Hortonworks sort of ridiculed the term, adding that it preferred the word “platform,” as in Hortonworks Data Platform), but by Cloudera’s insistence that it no longer saw them as competition.

Cloudera CEO Tom Reilly told GigaOm’s Derrick Harris that the company has set its sights on a larger market, the one in which IBM, Pivotal (and we’d guess HP, Teradata and so on) are the established players. So what's new today?

Hey Proprietary MPP Providers, Check This Out

This morning Cloudera will claim a major milestone when it releases the results of performance benchmark testing for its open source interactive SQL query engine, Impala. The company will reveal that Impala queries across data in an open Hadoop columnar storage format (Parquet) ran on average two times faster than identical queries on a commercial analytic database management system (DBMS) over its proprietary storage format.

In addition, Cloudera’s benchmarking results show that Impala has maintained or widened its performance advantage against the latest release of Apache Hive (0.12).

What’s the Big Deal?

While Impala has been known to have high performance in the open source world of Hadoop, no formal testing against proprietary MPP query engines has been done until now. With today’s benchmark testing results announcement, Cloudera puts a stake in the ground by establishing itself as the “go to” for all Enterprise data and analytics.

“In the past SQL on Hadoop has been viewed as separate,” said  Marcel Kornacker, Impala’s principal  architect. "This no longer needs to be the case."

With Impala, customers can "exceed their SQL performance experiences from proprietary databases but preserve the flexibility they enjoy with the Hadoop stack,” added Justin Erickson, director of product management at Cloudera.

Big Data Insights, No Disruptions in the Enterprise or Ecosystem

Disruptive technology is one thing. Disrupting the way people do mission critical work is quite another. Cloudera seems to understand the difference. That’s why its EDH strategy aims toward giving data scientists and BI workers an ability to access more data at petabyte scale using the tools (such as Tableau and Microstrategies) that they’re already familiar with.

Not only that, but the data that lands in the Enterprise Data Hub is immediately available, no Extract Transform and Load (ETL) step required, as is the case with many other vendors who offer SQL on Hadoop capabilities.

Finally, Impala, which was released to the public in May, is enterprise-grade and opens the door for queries on datasets that up until now haven’t been leveraged.

The Proof is in the Data, Says Cloudera

 Impala’s query performance was evaluated against a popular analytic database referred to as “DBMS-Y” (because a licensing agreement between the database maker and its users prevents it from being named.) Cloudera ran a series of 20 queries based on the industry-standard benchmark TPC-DS.  The results showed that:

  • Impala ran consistently faster than DBMS-Y: across 20 queries, Impala ran on average two times to DBMS-Y, outperforming DBMS-Y in 17 of the 20 queries.  For some queries, Impala was more than four times faster.
  • Queries over open data beat those over proprietary data: Even though Impala queries were done on open Hadoop data in the Parquet format, and DBMS-Y queries were done on data in its own proprietary format. Impala was still faster.
  • Impala scales linearly and predictably: In tests, Impala maintained identical response times with increased user concurrency and on larger datasets by simply adding new machines at the same rate as the concurrency and data growth.

Furthermore, Impala is still more than an order of magnitude faster than Hive: on identical hardware Impala queries ran on average of 24 times faster than those run on Apache Hive 0.12 using ORCfile.

What About Presto?

On Dec. 10 start-up Quoble announced that it is offering users access to Presto (now in alpha at Quoble), a Software-as-a-Service (SaaS) solution for interactive SQL queries on data stored in Hadoop. Though we haven’t seen Presto in action, if it performs as promised, it could give Impala a run for its money.

We asked Cloudera for its thoughts on the matter. Here’s what company execs said:

We think Presto is a strong endorsement for Impala’s technical direction as it makes a number of similar architectural choices to Impala. Facebook created Hive and is the single largest Hive user. Facebook’s Presto effort demonstrates that it has come to the same conclusion we did a number of years back that 'making Hive better' was a technical dead end when it comes to giving users the performance and functionality they need."

Because Presto was started well after Impala and built for Facebook’s usage, it still lags Impala in functionality and performance needed for the enterprise data hub. For example, Presto current lacks certified support from leading BI tools, fine-grained security, workload management, ANSI-92 SQL, a cost-based optimizer, an ODBC driver, and support for an efficient columnar format. We haven't yet had time to do a rigorous performance benchmark but based on anecdotal evidence from initial testing, both in-house and at partners, there's a good indication Impala is significantly faster than Presto and that Presto is still a far cry from entering the same performance league as the traditional analytic databases and Impala."

We don’t have insight into Quoble’s decision, but it’s worth noting that Quoble was founded by former members of Facebook’s data services team. Impala is available from multiple large-scale enterprise vendors including of course Cloudera and our resellers with Oracle’s Big Data Appliance, cloud vendors like IBM’s Softlayer, Verizon Business Systems, CenturyLink Savvis, T-Systems. Amazon announced Impala’s availability as part of their own EMR cloud offering. A direct Cloudera competitor, MapR, also offers Impala to their customers as well. We feel this broad adoption is yet another demonstration of Impala’s unique leadership role as a cornerstone of the enterprise data hub."

Title image by sippakorn (Shutterstock).