hub caps

Cloudera: More Hub Than Hadoop?

6 minute read
Virginia Backaitis avatar

Hadoop watchers, take a step back in time to October 2013. The place is the Grand Ball Room of the New York Hilton, the event is Hadoop World. Cloudera co-founder Mike Olson stands at the front of the stage. He is about to say something remarkable, even game-changing ... or so we think.

After a short monologue recounting the history of Apache Hadoop, he paints a vision in which his company is no longer merely a Hadoop distro provider, but something called an Enterprise Data Hub.

“Our customers are beginning to use Hadoop in new ways, as the center of their data centers," he tells the crowd.

Matt Brandwein, who was Cloudera’s director of product marketing at the time, explained the concept to CMSWire a day earlier.

“We (Cloudera) are becoming a data management company,” he said. “We want you to be able to store all of your data in one place, to be able to use your existing tools, and to help bring more benefits to your customers faster.”

The announcement was supposed to be a game-changer.

What Does an Enterprise Data Warehouse Bring to the World?

Or maybe it was merely a marketing tactic, a way for the most tenured vendor in the space to separate itself from the crowd as it took the covers off of its latest release, Cloudera Distribution for Hadoop (CDH 5.0). 

Remember, back then the Hadoop marketplace was jam-packed with vendors like Amazon, Cloudera, Hortonworks, IBM, Intel, MapRMicrosoftPivotal, Teradata and WanDisco, among others.

I remember overhearing MapR CMO Jack Norris tell a marketing consultant/journalist, “I don’t get it,” just after the announcement. Not that he didn’t understand what a hub was, but, “What’s the big deal?”

The deal, at least according to Cloudera co-founder and CTO Amr Awadallah, was huge, “Essentially, we’re saying there is a new industry now being formed,” he told GigaOm’s Derrick Harris.

MapR wasn’t the only vendor that was puzzled. Data Warehouse vendors Teradata and IBM were taken aback as well, and for good reason. It appeared that Cloudera wanted to displace them.

By February of 2014, Brandwein was telling the press that Hortonworks was not Cloudera’s competition.

"Increasingly, our customers are not viewing the relevant comparison as Cloudera vs. Hortonworks. They're viewing it as Cloudera vs. Hortonworks plus Teradata Aster, or, if you're talking to an IBM shop, Cloudera vs. IBM BigInsights plus Netezza," he told Information Week.

Some Things Change, Others Stay the Same

Cloudera has backpedalled since then: it now plays nice with vendors it thought it might make obsolete just two short years ago. Today it counts Teradata, Microsoft, EMC and IBM partners, among others.

Not only that, but from the looks of its website, Cloudera continues to identify itself more as a Hadoop provider that an Enterprise Data Hub. 

“No one knows Apache Hadoop like Cloudera,” it states. (We know at least one vendor who would beg to differ. Consider that Hortonworks writes on its site, “Hortonworks employs the largest group of committers under one roof; more than double any other company.”)

But forget that, the Hadoop wars are — at least for now — pretty much over. There are three independent vendors that dominate (Cloudera, Hortonworks and MapR), each with its own strategy and business model. We know that Hortonworks is growing in leaps and bounds, its records have been audited and are publicly available. Because neither Cloudera or MapR have disclosed such information, we can’t comment.

Learning Opportunities

The Return of the Hub

Though the Enterprise Data Hub is understated in Cloudera’s messaging of late (it wasn’t even mentioned in CMSWire’s briefing call), if you take a close look at Cloudera Enterprise Hadoop Platform 5.5 which was announced last week, it looks very much alive.

The star of the announcement was a proprietary SaaS tool, “Cloudera Navigator Optimizer,” which Constellation Research analyst Doug Henschen told CMSWire is aimed toward making it easier for companies to “optimize their data warehouses, saving money, by moving their data warehouse operations into Hadoop.” He noted that the four types of workloads involved are ETL (Extract, Transform and Load), BI reporting, ad hoc analysis and various other queries expressed in user-defined functions, stored procedures, and other ways of propagating SQL queries.

“The point of the optimizer,” he said, “is to analyze the SQL, spot redundancies, and provide tools to make it easier to move workloads to a Cloudera-based hub.”

A Growing Enterprise Footprint?

The SaaS solution is still in limited beta, but Anupam Singh, the head of data management at Cloudera, said that it’s already bringing big benefits to Cloudera’s “assessment” customers.

When a company puts an optimized query on Hadoop, for example, it saves capacity on what might be a more expensive original system, a broken down and simplified query becomes easier for engineers to maintain, and Hadoop itself becomes more productive, according to Singh.

Henschen noted that Optimizer will be of special benefit to Cloudera customers whose data warehouses are bogged down by a gamut of hard-to-maintain SQL code.

That being said, there may be political and “people” problems to overcome. Henschen pointed out that unless there is a “clear mandate from the CIO or central architecture team, database administrators and data integration professionals might not be too eager to give up the workloads that they manage to others — it’s not exactly a formula for job security."

NoSQL Databases Grow Up

While Cloudera Navigator Optimizer represents a nice win to Cloudera customers, whether it brings them an advantage that Hortonworks Data Platform (HDP) or MapR users don’t have, has yet to be seen. While Singh insists that Navigator Optimizer is 24 months ahead of any other product on the market, Henschen noted that there other tools to get the same job done. He pointed us to ETL vendors Informatica, Talend and SyncSort, for example.

Holger Mueller, also an analyst at Constellation Research, told CMSWire that Cloudera’s 5.5 release, along with Cloudera Navigator Optimizer, marks a point of growing up for NoSQL databases, a capability their ancestors, the RDBMS databases, have had for a long time. 

“Now we will have to see how well these new capabilities can increase throughput, response time and lower TCO of Hadoop deployments, but this is a milestone for the Hadoop database space.”

We should note, that Cloudera Navigator Optimizer is a proprietary tool built specifically for Cloudera customers.

Cloudera Navigator Optimizer holds the potential to strengthen Cloudera’s Enterprise Data Hub play in a big way — it not only adds value, but also a “sticky” factor to CDH.

Creative Commons Creative Commons Attribution 2.0 Generic License Title image by  avrene