The ideal of containerization is that workloads may be distributed safely across multiple platforms and orchestrated through a common portal. “Big data” is arguably the biggest workload of all, with some organizations now visualizing their data warehouses in terms of petabytes.
The big question is not really, “How do we containerize big data?” but actually, “Is there a real need to run Hadoop in containers?”
Beyond how convenient it may make things for certain software vendors, besides the intrigue it may generate in the IT department, is there a real-world payoff worth pursuing?
Tom Phelan is chief architect for a company called BlueData, which produces a platform called EPIC that enables Hadoop, and all the various components of the Hadoop platform, to run in a containerized environment such as Docker.
One of BlueData’s immediate benefits over Hadoop on bare metal or on virtual machines (VMs) is that it adds an abstraction layer between Hadoop and existing data storage.
This lets data in existing volumes be recognized by Hadoop processes in containers as though they had already been transferred to Hadoop’s HDFS multi-volume file storage, without the need to make that transfer explicitly.
VMware does have a method for virtualizing storage for Hadoop, by generating virtual machines that replicate the contents of existing storage clusters. Its method requires that name nodes and data nodes run inside virtual clusters.
“That would be dramatically different in a container world,” Phelan told CMSWire, “because containers can go out and reach storage directly. They don’t have to build this HDFS within the container.”
What’s more, he added, Hadoop name and data nodes do not need to be virtualized within containers to be addressable by them, improving performance.
Phelan believes the true value of containerization in big data could be realized at the development level.
The architecture of containerized environments, he argues, enables software developers and DevOps professionals to build off of each other’s work in an easier, less constrained way, changing the workflow of the creative side of data centers.
Experiments that just last year would have required a meeting, an agreement, an itinerary, a pilot run and an integration phase may now be boiled down to a test and a rejection, all within a few seconds. Ideas that don’t work in practice can now be eliminated long before anyone even notices they were created.
In an Hadoop environment, Phelan believes, this significantly changes how even the largest businesses can be moved from ancient data warehouse infrastructures to modern, fluid, adaptable systems.
“Once that software is done, I’m now transitioning it to an IT organization at a Fortune 5000 company, and they’re going to run their business on these solutions,” Phelan told CMSWire. “They don’t really care if it’s Docker or VMs. What they care about is to get that cheapest solution, and get the security they want.”
That last point is a definite maybe.
In a company blog post last month, Phelan acknowledged that container security remains a question mark in some circumstances, primarily because containers share the same underlying operating system — unlike the case with VMs, where operating systems are isolated, and an attack on one OS cannot bring down the rest.
Another company that’s attacking the containerized data problem from a somewhat different perspective, is Portworx. It produces an elastic storage infrastructure platform called PWX that does replace businesses’ existing storage, with a system that is both completely contiguous and container-aware.
But Portworx’ goal is to present single storage volumes to existing database applications, including more traditional relational database systems such as MongoDB and PostgreSQL. The idea here is to containerize the applications without rewriting or re-architecting them.
“An ideal data center, for us or for anybody, looks very agile, fluid, and Google-like,” explained Portworx CTO Goutham Rao.
He believes Docker is a key technology to enacting the complete vision of the software-defined data center, because it enables a single package format to accompany an application throughout its lifecycle, from development, through staging, and into production.
“Second, we envision a commodity data center,” Rao continued, comprised of very common and easily replaceable components. “Nothing is custom; it’s all commodity hardware. Now you need a software infrastructure fabric that’s capable of adopting these containerized applications, figuring out what types of resources they need, and running them in the data center.”
Both Portworx’ Rao and BlueData’s Phelan are aware that, for data centers to adopt containerization completely, some part of the ecosystem that runs applications will need to be redesigned.
What both are attempting to accomplish is limiting this renovation job to what corporations can reasonably afford, leveraging some existing components without rendering them “legacy.”
BlueData seeks to leverage existing storage infrastructure, enabling database applications to move to an Hadoop environment more readily. Portworx, meanwhile, leverages existing applications as a means of expediting the replacement of infrastructure, with newer and more adaptable volumes.
It’s because both companies know that data center technology never has, and perhaps never will, change in one tremendous wave like the shift to touchscreen smartphones.
“People aren’t going to forklift-upgrade off of VMware and move to Docker,” Rao told CMSWire. “You need to start phasing Docker into your environment, especially as your DevOps teams start adopting Docker.
“And the operations guys who are managing the data center will, at least over the next four to five years, still want a VM-centric approach to managing their data centers. That’s what they’re used to.”
Title image by Leon Fishman.