How data and schema interact with a data lake and data warehouse
Share
More deployment options
After years of building enterprise data warehouses, at first glance, a data lake architecture may appear to be similar to a data warehouse. After all, they’re both repositories that store data for wide-ranging purposes and so it can be confusing. Sure, some aspects are similar – both have data storage systems, query engines, and user management. Meanwhile, there are other aspects that are similar but inherently have different functionalities.
At the heart of the matter is that the two are different in how both data warehouses and data lakes use their data catalogs. In a data warehouse architecture, the catalog controls how data is loaded. This is why warehouses are considered less flexible but faster because data gets a structure first, then it is written in proprietary optimized formats. In a data lake, the catalog defines where existing data can be found and in what format (which is typically an open source format). Data lakes have traditionally been thought of as flexible and slower because data gets written in any format and then structured later.
Data Warehouse Architecture: based on a predefined and optimized schema
Common Data Warehouse Flow
An enterprise data warehouse defines clear structures for how enterprise-wide data is collected, organized, and queried. The core of the warehouse is the catalog, which includes the set of schemas for each table in the warehouse. As data is collected from various sources, it needs to be transformed into the structure that the warehouse expects for each data type.
For example, a retail data warehouse might specify the attributes for a customer record, including an identifier that is then included as an attribute in the transaction table. Users then identify information about customer transactions by running a query that joins these tables and the data warehouse query engine is tuned to answer those queries.
But if a group wants to load new data or change attributes for a data type, the changes need to be coordinated through updates to the central catalog before anything can be collected. Similarly, users are limited to querying data based on how it is structured in the catalog.
Data Lake Architecture: the catalog drives flexibility and agility
Data Lake Architecture Flow
By contrast, a data lake can accept any kind of data, in any format. Instead of routinely coordinating how data is stored, the catalog in a data lake helps users identify the types of data that get added to the data lake. This increased flexibility enables organizations to operate in absolutes or as flexible as the business demands.
A data lake catalog can be used to verify data as it is loaded and enforce the data types, or the catalog can be updated in response to data changes, mapping and combining old and new formats. Similar to a data warehouse, users might be required to use the central catalog as it’s defined or they can update the catalog to mirror their own view of the data that’s been loaded.
Though data lakes initially did not have the same query performance as data warehouses and are often combined with data warehouses, modern data lakes have query engines that are just as performant as data warehouses, especially when their cost is taken into equation, with the added benefits of data flexibility because of the power of the data lake catalog.
A different approach: data warehouse analytics capabilities, without the data warehouse
With a better idea on the difference between a data warehouse and data lake, in practice, many organizations have both a data warehouse and a data lake. What remains is that examining the data for insights and duplicate data remains a challenge. It’s not ideal for analytics or good data governance. Also, even though the data lake approach decoupled compute and storage, it still created data swamps and impacts the performance organizations desire for competitive enablement as well as strategic innovation.
For a new take on modern data management, it’s worth considering live, interactive queries directly on your data lake storage. Starburst is the fastest data lake query engine, which leverages data lake storage and enables business consumers to have easy and stable access to their open file formats of data, resulting in reduced data management costs and accelerated time to insights for critical data-driven business decisions.