Operational AI WIKI

All 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Data Lake

A data lake is a large storage repository that holds a huge amount of raw data (or data that hasn’t been cleansed) in its original format from a wide variety of sources until it’s needed. The data stored in a data lake typically does not yet have a specific purpose. Data lakes are the playground for data scientists and exploit the biggest limitation of data warehouses: flexibility and scalability because the data doesn’t have to fit a specific schema.

Data lakes are data storage structures wherein there is no hierarchy or structure to your data. The data lake can ingest data from disparate sources, and holds data in its native format – no matter the source or type, including structured, semi-structured, and unstructured data – until it’s ready for use. Data lakes were built around the premise of being able to aggregate your data into one central location to avoid data silos.

A data lake is an effective solution for companies that need to collect and store a lot of data, but do not need to process and analyze it right away. Because data lakes do not care about the format data is in, it makes them a great tool for aggregating data. However, that also means that data lakes are filled with many different data types and a lot of data, which results in poor direct query performance compared to other solutions. When it comes to creating measurable value, another analytics infrastructure tool or a performant layer on top of a data lake is almost always needed. Additionally, traditional data lakes often lack data governance and security controls.

To work around data lake limitations, users often end up extracting subsets of data  from the data lake and replicating those subsets within a data warehouse. This process typically requires IT assistance, slows down the time to insight, adds costs, and, in the end, undermines the benefit the data lake was intended to bring to the business.

Data lakes and data warehouses complement each other in a data workflow. Ingested company data is stored immediately into a data lake. If a specific business question comes up, a portion of the data deemed relevant is extracted from the lake, cleaned, and exported into a data warehouse for analytics use-cases and business decisions.