Cloudera made a big splash at O'Reilly Strata + Hadoop World 2013 in New York City last October when it announced its Enterprise Data Hub strategy. It wants it to be the place where companies park all of their data, regardless of its format, and from which they can use BI tools, whichever they happen to be.
At the time, some of Cloudera’s Hadoop competitors were a bit taken aback — not by the idea of an Enterprise Data Hub (MapR suggested it more or less already offered one. Hortonworks sort of ridiculed the term, adding that it preferred the word “platform,” as in Hortonworks Data Platform), but by Cloudera’s insistence that it no longer saw them as competition.
Cloudera CEO Tom Reilly told GigaOm’s Derrick Harris that the company has set its sights on a larger market, the one in which IBM, Pivotal (and we’d guess HP, Teradata and so on) are the established players. So what's new today?
Hey Proprietary MPP Providers, Check This Out
This morning Cloudera will claim a major milestone when it releases the results of performance benchmark testing for its open source interactive SQL query engine, Impala. The company will reveal that Impala queries across data in an open Hadoop columnar storage format (Parquet) ran on average two times faster than identical queries on a commercial analytic database management system (DBMS) over its proprietary storage format.
In addition, Cloudera’s benchmarking results show that Impala has maintained or widened its performance advantage against the latest release of Apache Hive (0.12).
What’s the Big Deal?
While Impala has been known to have high performance in the open source world of Hadoop, no formal testing against proprietary MPP query engines has been done until now. With today’s benchmark testing results announcement, Cloudera puts a stake in the ground by establishing itself as the “go to” for all Enterprise data and analytics.
“In the past SQL on Hadoop has been viewed as separate,” said Marcel Kornacker, Impala’s principal architect. "This no longer needs to be the case."
With Impala, customers can "exceed their SQL performance experiences from proprietary databases but preserve the flexibility they enjoy with the Hadoop stack,” added Justin Erickson, director of product management at Cloudera.
Big Data Insights, No Disruptions in the Enterprise or Ecosystem
Disruptive technology is one thing. Disrupting the way people do mission critical work is quite another. Cloudera seems to understand the difference. That’s why its EDH strategy aims toward giving data scientists and BI workers an ability to access more data at petabyte scale using the tools (such as Tableau and Microstrategies) that they’re already familiar with.
Not only that, but the data that lands in the Enterprise Data Hub is immediately available, no Extract Transform and Load (ETL) step required, as is the case with many other vendors who offer SQL on Hadoop capabilities.
Finally, Impala, which was released to the public in May, is enterprise-grade and opens the door for queries on datasets that up until now haven’t been leveraged.
The Proof is in the Data, Says Cloudera
Impala’s query performance was evaluated against a popular analytic database referred to as “DBMS-Y” (because a licensing agreement between the database maker and its users prevents it from being named.) Cloudera ran a series of 20 queries based on the industry-standard benchmark TPC-DS. The results showed that:
- Impala ran consistently faster than DBMS-Y: across 20 queries, Impala ran on average two times to DBMS-Y, outperforming DBMS-Y in 17 of the 20 queries. For some queries, Impala was more than four times faster.
- Queries over open data beat those over proprietary data: Even though Impala queries were done on open Hadoop data in the Parquet format, and DBMS-Y queries were done on data in its own proprietary format. Impala was still faster.
- Impala scales linearly and predictably: In tests, Impala maintained identical response times with increased user concurrency and on larger datasets by simply adding new machines at the same rate as the concurrency and data growth.
Furthermore, Impala is still more than an order of magnitude faster than Hive: on identical hardware Impala queries ran on average of 24 times faster than those run on Apache Hive 0.12 using ORCfile.
What About Presto?
On Dec. 10 start-up Quoble announced that it is offering users access to Presto (now in alpha at Quoble), a Software-as-a-Service (SaaS) solution for interactive SQL queries on data stored in Hadoop. Though we haven’t seen Presto in action, if it performs as promised, it could give Impala a run for its money.
- Box Cops to Bad IPO Timing, It's Time to Unbox
- Extracting Insight from Unstructured Data
- Trends in Web Content Management From #jboye14
- IBM: Our Verse Email Beats Anything from Microsoft, Google
- Are You Too Old to Work in Tech? IT's Midlife Crisis
- Who Are the 100 Fastest Growing Software Companies?
- Outage Outrage As Microsoft's Azure Stumbles