Cloudera made a big splash at O'Reilly Strata + Hadoop World 2013 in New York City last October when it announced its Enterprise Data Hub strategy. It wants it to be the place where companies park all of their data, regardless of its format, and from which they can use BI tools, whichever they happen to be.
At the time, some of Cloudera’s Hadoop competitors were a bit taken aback — not by the idea of an Enterprise Data Hub (MapR suggested it more or less already offered one. Hortonworks sort of ridiculed the term, adding that it preferred the word “platform,” as in Hortonworks Data Platform), but by Cloudera’s insistence that it no longer saw them as competition.
Cloudera CEO Tom Reilly told GigaOm’s Derrick Harris that the company has set its sights on a larger market, the one in which IBM, Pivotal (and we’d guess HP, Teradata and so on) are the established players. So what's new today?
Hey Proprietary MPP Providers, Check This Out
This morning Cloudera will claim a major milestone when it releases the results of performance benchmark testing for its open source interactive SQL query engine, Impala. The company will reveal that Impala queries across data in an open Hadoop columnar storage format (Parquet) ran on average two times faster than identical queries on a commercial analytic database management system (DBMS) over its proprietary storage format.
In addition, Cloudera’s benchmarking results show that Impala has maintained or widened its performance advantage against the latest release of Apache Hive (0.12).
What’s the Big Deal?
While Impala has been known to have high performance in the open source world of Hadoop, no formal testing against proprietary MPP query engines has been done until now. With today’s benchmark testing results announcement, Cloudera puts a stake in the ground by establishing itself as the “go to” for all Enterprise data and analytics.
“In the past SQL on Hadoop has been viewed as separate,” said Marcel Kornacker, Impala’s principal architect. "This no longer needs to be the case."
With Impala, customers can "exceed their SQL performance experiences from proprietary databases but preserve the flexibility they enjoy with the Hadoop stack,” added Justin Erickson, director of product management at Cloudera.
Big Data Insights, No Disruptions in the Enterprise or Ecosystem
Disruptive technology is one thing. Disrupting the way people do mission critical work is quite another. Cloudera seems to understand the difference. That’s why its EDH strategy aims toward giving data scientists and BI workers an ability to access more data at petabyte scale using the tools (such as Tableau and Microstrategies) that they’re already familiar with.
Not only that, but the data that lands in the Enterprise Data Hub is immediately available, no Extract Transform and Load (ETL) step required, as is the case with many other vendors who offer SQL on Hadoop capabilities.
Finally, Impala, which was released to the public in May, is enterprise-grade and opens the door for queries on datasets that up until now haven’t been leveraged.
The Proof is in the Data, Says Cloudera
Impala’s query performance was evaluated against a popular analytic database referred to as “DBMS-Y” (because a licensing agreement between the database maker and its users prevents it from being named.) Cloudera ran a series of 20 queries based on the industry-standard benchmark TPC-DS. The results showed that: