Stop using the terms Data Lake and Data Warehouse interchangeably!

Pradnya Asolkar
3 min readJul 9, 2021

Forbes has rightly coined a term for this era which is “Age of (too much) data” and as such it is becoming imperative for organizations and analytical service providers to set up these huge data repositories to save this data and further use it for advanced visualization, story-telling, modeling and optimizing the accessibility of this data. In this blog, let’s briefly discuss these repository models and the key differences between Data Lakes and Data Warehouses.

“Hiding within those mounds of data is the knowledge that could change the life of a patient, or change the world.” (Atul Butte, Stanford)

Let’s understand each of them a little better, shall we?

  1. Data Lakes: These are huge pounds of data that you collect from all possible types of transactions, can be internal or external. Let’s talk about some relatable examples. Organizations such as P&G and Unilever sell almost all our day-to-day used products ranging in beauty care, soaps and dish care, laundry care, baby care, packaged foods, and other categories. Just imagine the data they generate on a day-to-day basis for a country like US or a continent like Europe. It’s huge right?! And though some of it might be your syndicated data, just imagine the data generated through texts on Twitter, Facebook, images from Instagram or videos from Youtube. There is a whole new world of unstructured data (data that is not in a neat & structured format within rows and columns) which when processed through sophisticated algorithms can teach you so many things about your business. So, organizations need to find a cost-efficient way to store these large quantities of data. The data lake model tends to ingest data very quickly and prepare it later, on the fly, as people access it. Some very good cases about the usage of Data Lakes can be found in this article by Arcadia Data. They revolve around:
  • Oil and Natural Gas
  • Smart City Initiatives
  • Life Sciences
  • Cybersecurity
  • Marketing and Customer Data Platforms

2. Data Warehouse

A data warehouse gathers data from different sources, whether internal or external and cleanses the data while also partially processing it for retrieval for different business teams for their analysis. The data is usually structured, often from relational databases, but it can be unstructured too. The main difference between them and traditional databases is that the latter is just a storage repository of data. Data warehouses are information systems built from multiple data sources and are used to analyze data.

Before wrapping up, some other buzzwords in the data domain are Data Mart and Data swamp, and EDW. Let’s brush them off quickly as well.

a) Enterprise Data Warehouse (EDW): This is a data warehouse that is set up for and used by an entire enterprise.

b) Data mart: This can be considered as a subset of Data Warehouse. While a data warehouse is a versatile storage unit for multiple use cases, a data-mart is designed and built specifically for a particular department/business function.

c) Data swamp: When your data lake gets messy and is unmanageable, it becomes a data swamp! Data Swamps have no curation, including little to no active management throughout the data life cycle and little to no contextual metadata and Data Governance.

Conclusion: For each its own! Depending on the business problem we are trying to solve, and arriving at the cost factors behind building it, you need to take a calculated decision of what storage model suits best. It is not a one-month or even a quarterly journey but takes years to establish a viable and systematic data “safe”(pun intended). Data that goes into databases and data warehouses need to be cleansed and prepared before it gets stored. And with today’s unstructured data, that can be a long and arduous process when you’re not even completely sure that the data is going to be used.

Thanks,

Pradnya

--

--