If you listen to some big data pundits, you might get the impression that Hadoop can do almost anything. Heck, I let Hadoop babysit my kids last weekend. There is, however, couple of tiny things the big data powerhouse has not been able to conquer — SQL syntax and real-time queries. Cloudera is trying to close that gap.
Big Data Fast
Hadoop has become popular for its ability to process massive amounts of data faster than traditional data management solution. However, it is and has always been, a batch-oriented tool — and modern business doesn’t happen at a batch pace. The need to make fast decisions based on answers hidden in large datasets has driven users and vendors to develop workarounds that make Hadoop faster and work more like relational databases.
However, most of the solutions still didn’t achieve exactly what many wanted — something that had the feel of a relational database, but the power of Hadoop. One of the earliest players in the Hadoop space, Cloudera, has spent the last two years developing a tool that runs beside MapReduce to query data stored in HBase or Hadoop’s Distributed File System (HDFS). Cloudera publicly launched the new framework, Project Impala, at Strata and Hadoop World. Impala works like another Hadoop component, Hive, but works much faster. The new framework uses a SQL-like syntax, but executes in real time.
Impala, Hive Offer Comprehensive Solution for Data Queries
Charles Zedlewski, VP Products, Cloudera, discussed the need the company saw in the market for a new solution. Users were frustrated and often copied data between Hadoop and relational databases to get the answers they needed. Zedlewski says that will no longer be required with Impala. Cloudera is giving users access to large datasets in a way that is more intuitive and simultaneously offers a fast response time.
Users may be happy to get new features, but what does this really mean for the world of data? Business have relational databases, data warehouses and Big Data repositories and now most of them support multiple styles of queries. There is bound to be a great deal of confusion about how to use each platform, and even whether they should continue to maintain separate data environments. Data warehouses have traditionally stored operational data and allowed users ask simple questions about large volumes of transactional data. Analytics platforms often used the same transactional data, but were designed to answer more complex questions. Then came the Big Data repositories.
If Cloudera is successful, the days for stacks of repositories and complex ETL processes may soon be over. With Impala, users could arguably perform both types of analysis in Hadoop. Traditional data warehouse functions would be handled by Impala and more complex queries by Hive.
Getting More Information
Cloudera is holding a free webinar November 6 (registration required). If you’re want a more hands on experience, Cloudera Impala is open source and is now available for anyone to try as part of the public beta program. Cloudera is also changing the pricing for Cloudera Manager, an administrative console for its commercial Hadoop distribution.