In June, Spark released a new version that included Spark MLlib, a machine learning pipeline for prepping and transforming data.
Then in November, Alan Choi, a software engineer at Cloudera working on the Impala project, described an approach for the "continuous loading of data into Impala via HDFS, illustrating Impala ability to query quickly-changing data in real time" in this blog post.
New Support for Presto, Spark SQL
Notice what these examples all have in common? They all involve Hadoop SQL dialects (okay that one was easy) and they all are key improvements that make it far easier for users to query Hadoop quickly. And they all rolled out last year.
"Most of these SQL dialects were written for batch jobs," Colin Zima, chief analytics officer of Looker, a big data business intelligence provider based in Santa Cruz, Calif., told CMSWire. "But they have been evolving very fast and have reached a tipping point."
A tipping point for Looker, that is, which has been waiting for some of the larger dialects in the Hadoop ecosystem to become enough to support its business intelligence (BI) platform.
That day is here. Looker has announced support for Presto and Spark SQL as well as updates to its support for Impala and Hive.
Compatibility With Amazon
This new modeling layer adds to Looker’s portfolio of supported data warehouses, which includes Amazon Redshift. They also mean that Looker users have complete compatibility with the Amazon Elastic MapReduce suite of frameworks.
Looker's support of Presto and Spark SQL helps AWS customers access all their organizational data, said Anurag Gupta, Vice President of Database Services at Amazon Web Services.
It doesn’t matter whether that data resides in Amazon Relational Database Service (or Amazon RDS), Amazon Redshift, or, now, in an Amazon Simple Storage Service data lake accessed through one of the many SQL engines supported by Amazon EMR, Gupta said.
Storing Raw Data
Admittedly, there companies and applications to help users jerry-rig around this problem, such as pulling the data out into smaller databases and working from there, Zima said. But, he continued, those workarounds were not entirely satisfactory and didn’t fully leverage Hadoop's firepower.
Hadoop was originally used to store raw data and that is what the SQL dialects focused on — the batch processing of that data, Zima said. "There was little to no support for even simple analytical functions."
For analytical use the data needs to be optimized differently — and the query needs to be processed much, much faster.
All of the major SQL dialects are now fast enough for queries, Zima said.
Some Missing Pieces
Not that this evolution is complete, he added.
All of the SQL engines have bits and pieces missing. For example, Impala can't do count DISTINCT on multiple columns -- they recommend using NDV, which is a hyperloglog estimate, Zima explains. Also "several dialects are still missing a native hash function that is secure and, as of 1.6, Spark still requires setting the parallelization for joins and aggregates."
But the changes of the last year or so are enough to get Looker started.
A Unique Position
As it does so, the company is positioned in a unique spot, after avoiding much of the jerry-rigged models of the past. "Users can just leave the data where it is in Hadoop and query from our platform," Zima said.
This new modeling layer Looker has built for the SQL Hadoop engines includes all of the necessary interactivity and logic needed to support business intelligence, he said. There is also a front end for the user to build charts and other visualizations.
For users, it is a plug-and-play. "Once the database credentials are in Looker, you can start querying the database immediately," Zima said.