Big data is here, and it’s getting bigger by the day. With more than 80 million Internet of Things (IoT)-devices set to enter the market by 2020, brands are busy making plans to offer new customer experiences and preparing for an unprecedented onslaught of data.
Data lakes and data warehouses are two ways that a brand can collect and manage all that data, but what’s the difference between the two? We talked to practitioners to see how they differentiate the two.
What Is a Data Lake?
A data lake acts as a centralized repository where you can store all of your structured and unstructured data, on any scale. In a data lake, you are able to store your data “as is,” without needing to structure it, and you can run different types of analytics.
“A data lake is generally created without a specific purpose in mind. It includes all source data, [both] unstructured or semi-structured, from a wide variety of data sources, which makes it much more flexible in its potential use cases. Data lakes are usually built on low-cost commodity hardware, which makes it economically viable to store terabytes or even petabytes of data,” said Ashish Thusoo, co-founder and CEO of Qubole.
Related Article: Github's Top Open Datasets For Machine Learning
What Is a Data Warehouse?
A data warehouse, also known as a enterprise data warehouse, is a data storage system that aggregates structured data from various sources for the purpose of comparison and analysis in the field of business intelligence.
“A data warehouse is a repository of many kinds of data and is highly modeled. In other words, any data you find in a data warehouse is going to be carefully related to all of the other data in the [data] warehouse. In addition, data in a warehouse tends to be highly standardized and cleansed,” said Jake Freivald, vice president at Information Builders.
What’s the Difference?
A data lake can be considered a vast pool of raw data where the purpose is not defined. A data warehouse is a repository for structured and defined data that has already been processed for a particular purpose.
“The greatest difference between data lakes and data warehouses is the varying structure of raw [and] processed data. Data lakes primarily store raw, unprocessed data, while data warehouses store processed and refined data,” said Isabelle Nuage, director of product marketing at Talend.
Related Article: What Separates One Customer Data Platform From the Next?
What Are the Pros and Cons of Each?
Since data lakes primarily store raw and unprocessed data, the stored data can be utilized for any purpose, which makes it ideal for artificial intelligence (AI), machine learning and data science. However, unprocessed data does require a large storage capacity and there is also an issue of data governance.
"The biggest proof a data lake is that it was designed as cheap storage. However, as cheap raw storage, the cons fall into the handling of the data. How do you handle metadata, security, governance in a data lake? This is where costs could rise,” said Bill Peterson, VP of industry solutions at MapR.
Thusoo added, “Data lakes can yield results more quickly because so much data is already there. However, data lakes place more responsibility on the user to explore the data and find the use cases.”
As for data warehouses, since the stored data is structured and already processed, it makes it easier for enterprises to find and understand the data, Freivald explained. “Data warehouses are great for exploring data relationships across the enterprise. For example, if customer, product and facility information are all in the data warehouse, the [data] warehouse makes it relatively easy to see the customer satisfaction and returns ratings are related to the different facilities at which those products are made,” Freivald said.
But this significant advantage of data warehouses provides little flexibility and does require a great deal of labor. “Data warehouses take serious effort to build and maintain, with minor changes taking a long time to implement, because when new data is added it has to be reconciled in relation to all of the other data in the warehouse,” Freivald said.
Freivald noted that in data lakes adding data is relatively straightforward since the data does not need to be “reconciled with other data.”
What Are the Best Use Cases of Both Data Lakes and Data Warehouses?
As mentioned, data lakes provide a good source of raw data that enables users to utilize the data within their chosen context. Freivald explained how data lakes are frequently used by data scientists and also for AI purposes. “Data scientists often use data lakes to pull together information that hasn’t previously been considered in context with each other. [For example], testing the recurrence of one disease after a patient took a certain drug for a different disease,” Freivald said. “[Also], AI frequently looks at raw data in a data lake to discover patterns that we didn’t know existed.”
On the other hand, data warehouses are fit for providing enterprises with consistent data for repeatable processes. “Data warehouses are the premier source of data for business reporting and dashboards,” shared Freivald.
“Business people value the consistency and clarity that warehoused data brings. Analysts often use data from warehouses for ongoing analyses of trends, such as year-over-year analyses.”
Data Lakes and Data Warehouse: Do You Need Both?
And finally, even though data lakes and data warehouses are distinct, "most companies need both," said Kelly Stirman VP of strategy and data analytics at Dremio. “Companies can first consolidate data from many sources into a [data lake] where they can perform a variety of workloads including preparing data for the data warehouse, running batch analytical workloads, running machine learning workloads and more,” Stirman continued.