In an exclusive interview with CMSWire’s Virginia Backaitis last week, Microsoft Corporate Vice President T. K. “Ranga” Rengarajan threw down the gauntlet with Amazon. He challenged Amazon's Redshift cloud-based data warehouse to scale the way Microsoft’s new Azure SQL Data Warehouse does, with respect to both compute size and storage capacity.
Rengarajan’s challenge raised some interesting questions, including, “What are we talking about here?”
2 Levers Are Better than 1
Scalability has been an Amazon trademark since the company launched AWS. With respect to Redshift, its capacity has been scalable since its public launch in February 2013.
In an interview this week at Microsoft’s Ignite conference in Chicago, Rengarajan stressed that the Azure data warehouse product is scalable in terms of both capacity and compute capability for large-scale jobs.
But Amazon's proponents have said the same for Redshift.
What’s more, there’s the question of what we mean by “data warehouse” any more, in the context of the cloud. In the 1990s, a data warehouse was a sophisticated system of preparing and processing data for maximum efficiency with the reporting structures of that day.
With cloud dynamics pooling together storage, memory and processing, even on-premise, there’s a fair argument to be made that a cloud data warehouse is a kind of halfway house for organizations migrating to big data.
“In the case of data warehousing, we’re separating our responsibility to store the data and compute – we’re separating the meters for the two,” Rengarajan told CMSWire yesterday. “And we’re saying, we’re going to enable you to elastically scale the compute as you wish."
Amazon has taken the perspective that it’s almost like a box that they take and set it up in the sky. We can change it, but essentially what they’re doing behind the back is take all this data and pull it to a bigger box that they’ve hosted somewhere else.”
It isn’t really scalability, he argued, when you’re moving data from one box to another. He said this makes changes happen in much smaller increments, including with processing capability. To make his point, he noted that a 20-way join involves exponentially greater underlying transactions than a 10-way join.
“You only pay when you use compute,” Rengarajan said. “So if you park a petabyte with us and you’re not doing any queries with us, you’re just paying for the parking fees.”
The core engine of Azure’s data warehouse remains SQL Server, but this time refactored for massively parallel processing (MPP).
“Every query runs in parallel, as opposed to some of the open source systems you may be aware of,” he said. “Running one query in parallel across multiple cores is not child’s play. There’s a reason it’s not casually built in open source.”
The nature of parallel processing in data runs somewhat contrary to the highly tuned synchronicity of the pre-cloud data warehouses. Since even single queries are not child’s play, we put a question to Rengarajan: "Don’t existing data warehouse schemes have to be re-engineered, and processes completely reconsidered?"
Put another way, isn’t the old system of data management (Online Transaction Processing or OLTP) like hiring a chain of ants to carry water in buckets, while the new system replaces them with a fire hose?
Rengarajan answered that a great many reporting applications that were put through OLTP over the years are actually simple queries or simple analytical processes, such as producing pie charts and comparing pie charts. Businesses were dipping their toes into the ocean, and many may not have gotten all that wet in the process.
On the other end of the scale, data warehouses and data marts were “huge data processing systems.”
“The question is, do you drag this thing by a million ants or 10,000 Iditarod sled dogs or one elephant? Does it make sense for one of these units to be a single-threaded operation? It doesn’t make sense.”
He borrowed an analogy from Intel, whose latest Xeon E7 v3 processors may have up to 18 cores apiece. On a two-socket server, that’s 36.
“If you had a system on a unit like that in the cloud, it’s stupid to take that unit and then chop it into 36 little pieces and create a VM (virtual machine) for each one of them and run your dinky little ant on that thing. That’s stupid. That’s where SQL Server comes in. It’s able to take a single query and happily use the [whole] machine to do that.
Size is very important. You want a single system that’s cost-optimized for the cloud to be something that a single unit can fully consume without going into unnatural acts of chopping a computer into 36 and putting it back together.”
It’s the long way around the argument, avoiding the obvious, “Yes, some things must definitely change” statement. But in summary, his point that is that in order for cloud-based data warehousing to remain cost effective over time, hardware needs to be incorporated into it, rather than the other way around.
It’s Ranga Rengarajan’s way of distinguishing Azure SQL from Redshift, by characterizing Amazon’s cloud control as a subdivision of available hardware, compared to Azure’s system of incorporating resources and capacity as necessary. He says it’s this kind of cost optimization that could enable organizations to implement a kind of “Moore’s Law” in their long-term strategies, keeping costs low by scaling up compute capacity on an itinerary.