With a typical hybrid cloud architecture, relational databases that span storage volumes across premises into the public cloud are facilitated — rather delicately — by a cloud management platform (CMP).
But as OpenStack’s own documentation concedes, “there is no single CMP that can address all needs in all scenarios.”
Hadoop’s approach to stretching data across volumes is via a completely separate operating system using a cloud-capable, cross-volume file system (HDFS). But that’s for unstructured data. If relational databases were managed so that individual tables could span volumes, maybe some data warehouses wouldn’t need an entirely separate file system.
That’s the goal Microsoft has set for itself, with the upcoming release this summer of a public preview of SQL Server 2016.
The End of Archiving
In an interview with CMSWire last week at Microsoft’s Ignite conference in Chicago (the first part of which appeared last week), Corporate Vice President T.K. “Ranga” Rengarajan explained this new feature’s basic premise, enabling the lengths of tables within schemas to extend from on-premise to Azure storage.
“Any table that is cumulative, that you used to throw away after a certain period of time, now you never have to throw away,” said Rengarajan.
“You no longer have to discard the past. You can keep all of it conveniently accessible, and it transparently flows into the cloud. Applications don’t need to change; applications don’t even know that some of the data resides locally, and some of it resides in the cloud.”
Rengarajan painted a picture of a new kind of infrastructure layer for SQL Server databases that incorporates local storage with Azure storage without altering the schema.
Nothing about the way databases are constructed today would need to change, in order for tables to be stretched beyond the limits of on-premise storage volumes.
In this new infrastructure layer, all data is encrypted by default. This way, tables do not need to undergo a separate encryption process, in order to be stretched into Azure territory.
“We think this is a gentle introduction [to Azure] for our on-premise customers,” he said, “because we feel both the opportunity to support them and the obligation to support them in getting to the cloud in responsible ways.”
But that neatly painted picture may be overlooking one very critical truth about data warehouses: They’re already partitioned, as the most common way of dividing data among multiple volumes.
The End of Heat Maps
Archiving is one typical methodology for determining how data should be partitioned within data warehouses (DW). In situations where old data may be just as relevant as new, DWs sometimes produce “heat maps” that divide the most relevant data from the least accessed.
When database vendors first began coping with the fact that Hadoop clusters were everyday facts with which they’d have to deal, way back in the dark ages of computing (2009), they suggested that Hadoop analytics could be best put to use dividing hot data from cold, and preparing the warmest varieties for extract/transform/load (ETL) processing — cleansing data for inclusion in typical DWs.
The reason: DWs were already partitioned, and vendors suggested that database architects should not ignore the partitions they’d gone through so much trouble to create.
Here is how Microsoft’s own SQL Server documentation explains the process, using words that the company’s own editors presumably looked over and approved for publication: “Table and index partitioning in SQL Server 2005 facilitates better management of very large databases through the piecemeal management of partitions.
“One drawback of using partitioned views,” the documentation goes on, “is that when you do, database operations have to be performed on individual objects as opposed to on the view itself. For example, if the existing index has to be dropped and a new index created, the operation has to be performed on each underlying table.”
The End of “Chains of Ants”
Most relational database architecture, right on through the big data era, has been forced to take into account the probability that tables and indexes would need to be partitioned. For that reason, operations such as the fetching of records takes place in small batches, cognizant that the source of subsequent batches in a fetch may be redirected to a different location.
The analogy I’ve used before is this: Compare a chain of ants passing tiny buckets of water to fight a fire, to one fellow with a hydrant and a hose.
The SQL Server innovation to which Rengarajan refers flies in the face of partitioning. It enables new database architects to assume that any method that gathers records from a single volume, regardless of size, can be applied to a table that spans continents.
It’s a magnificent dream. But it does imply a change at the fundamental level: specifically, a radical simplification to dispense with the workarounds that architects have had to implement up to now.
“You’re absolutely right that this is some serious engineering at the lowest levels,” Microsoft’s Rengarajan told me. “We have worked really, really hard to make it, quote-unquote, ‘transparent’ to the application developer.
“Our theory is the following: You have an on-premises database, an on-premises application. Let’s assume that you’re happy with it,” he continued.
“Then let’s look at this data that forever accumulates, and ask, when do you discard? You use that as the criteria to stretch to the cloud. So if you did exactly the same thing as you did before — never touch the data that’s older than a certain amount — you should see zero difference, [besides] the additional overhead of moving the data from here to there.”
As Rengarajan confirmed, this does mean queries that span the length of a table, addressing both new and old data without the use of archive partitions, will yield predictable, positive results. Maybe there’s a small amount of overhead during the transition period, but such overhead would be more than compensated for, theoretically, by newer and simpler queries and fetch processes.
Cloud dynamics enabled big data to alter our concept of databases from the outside. Stretchable tables is the latest instance of cloud dynamics changing it from the inside.