A data lakehouse as a hybrid solution
Share
More deployment options
Data lakehouses, also known as modern data lakes, are a hybrid solution that combines the functionality of data lakes, data warehouses, and databases into a single technology.
This is highly appealing for organizations that employ multiple solutions for different use-cases, as they can often replace multiple systems with a single lakehouse. At the same time, a lakehouse is not a monolithic, single source of truth and works well with other technologies using Starburst Galaxy. This extended functionality combined with the option to mix and match with other technologies as needed presents many upsides.
Scroll down and learn more about databases, data warehouses, and traditional data lakes.
Databases Definition
Databases represent the traditional foundation of data technology. Although they can take many forms, all databases facilitate the structured collection of data. This data is organized and stored in a systematic way to allow for the efficient retrieval, management, and manipulation of information.
Database technologies are often categorized by their functionality, purpose, and capabilities, including the database types listed below.
Database Management Systems (DBMS)
DBMS represent the backbone of many data solutions. They are typically used to perform Create, Read, Update and Delete (CRUD) operations in support of numerous business applications.
Online Transactional Processing (OLTP)
These systems are used for transactional data, including records of financial transactions and other log data.
ACID compliance
ACID stands for Atomicity, Consistency, Isolation, and Durability. It is a set of design standards that guarantee the reliable processing of transactions.
In ACID compliant systems, transactions either complete fully or fail fully. This ensures the accuracy and consistency of critical data.
Exploring databases: Starburst Academy
Learn what a database is and what shapes it can take.
Data Warehouse Definition
Data warehouses are a traditional data technology that employ a centralized, structured, and integrated repository specifically designed for querying, reporting, and analysis. Originally, all data warehouses were held on-prem. In recent years, the movement towards cloud data warehouses has begun. However, whether on-prem or cloud, data warehouses include many of the same advantages and disadvantages.
Structured data
Data warehouses require structured data to operate. Importantly they require this structure to conform to a schema before data enters the warehouse. This process is often known as schema on write.
High performance
For the analysis of structured data, data warehouses are highly performant. Because of this, for certain types of workloads, they remain the default solution for many business applications.
Not ideal for unstructured data
Because data warehouses require data to be structured before it enters the warehouse, they are not ideal for unstructured data. A data lake is far superior for unstructured and semi-structured data.
High cost
Data warehouses are a resource-intensive approach to data analysis, both in terms of institutional and technological resources. This often results in high costs compared to alternatives, like data lakes.
Exploring data warehouses: Starburst Academy
Examine data warehouses and explore how they are used to store data for analytics.
Data Lake Definition
A data lake is a modern storage technology designed to house large amounts of data in a raw state for analysis and are often used in Machine Learning and Artificial Intelligence (AI) applications. Unlike data warehouses, this data can be structured, semi-structured, or unstructured when it enters the lake. Transformation is performed when this data is used, following a schema on read process. For this reason, data lakes excel at analysis of unstructured data, often at a fraction of the cost of traditional data warehouses.
Multiple data formats
Data lakes are capable of storing multiple data formats without requiring them to be pre-structured or schematized in advance. This includes data types that are:
- Structured
- Semi-structured
- Unstructured
Inexpensive cloud object storage
The underlying technology used in most data lakes is very inexpensive compared to other solutions. This makes the storage of large amounts of raw data particularly suitable for storage in a lake.
Optimized for Machine Learning (ML) and AI
Data lakes store a large amount of raw data. This makes them ideal for ingestion into Machine Learning frameworks, often driven by models built using Python, Scala, or R. They are also a key technology used in the training of emerging Artificial Intelligence (AI) models.
Main factors that drive data lake adoption
There are a number of factors driving data lake adoption in recent years. Many of these stem from the inherent advantages of the technology itself, but some also indicate its positive impact at the organizational level.
Ultimately, a combination of scalability, cost, and flexibility are driving data lake adoption at many organizations. Let’s look in a bit more depth.
Scalability
Data lakes provide businesses with an inexpensive, scalable storage system capable of ingesting multiple types of data in a raw format. Importantly, they also separate compute and storage, allowing each to be increased or decreased independently. In this way, lakes allow for targeted resource allocation responsive to changing needs.
For example, if an organization needs additional storage, they can purchase this quickly and efficiently without also having to purchase compute resources that may not be needed. Instead, compute or storage needs are scaled individually, in real-time, saving significant time, effort, and money.
Cost
The cost of cloud object storage is also a major factor in the adoption of data lakes. Compared to the alternatives, cloud object storage is very inexpensive. At the same time, as data lakes have matured, their performance has grown to match traditional data warehouses.
The combination of a scalable, inexpensive, efficient system is a clear win for many organizations, defining an ideal overlap between different factors.
Exploring data lakes: Starburst Academy
Explore how data lakes function and how they differ from other types of data storage solutions, most notably data warehouses
The difference between a data lake and data lakehouse
In recent years, data lakes have matured to a point where modern data lakes include many features lacking from earlier iterations. These modern data lakes are defined by modern, open table formats like Iceberg, Delta Lake, or Hudi. They include improved transactional support, rivaling the best OLTP databases, and features like schema evolution, partition evolution, and time travel. This has further closed the gap between data lakes, data warehouses and databases. This convergence has given rise to the term data lakehouse to describe modern data lakes, named because of their mixture of data lake and data warehouses.
Data lakehouses include several key features that differentiate them from traditional data lakes. These include:
Enhanced governance layer and metadata collection
Compared to a traditional data lake, a data lakehouse handles data governance in a superior way. This is the result of a more advanced approach to metadata collection, which is able to better construct an accurate record of changes in the table at any given time. This is the key feature that allows for enhanced capabilities, and the adoption of functionality that blur the distinction between a data lake, data warehouse, and database.
Traditional data lake
Data lakes, especially those using cloud object storage, typically make use of an immutable storage layer. This might be built using AWS S3, Azure ADLS, Google Cloud Storage, or another similar system.
In these systems, data can be read and written, but cannot typically be fully deleted or updated. This problem often impacts Hive users, who might find themselves in situations where they believe that user data had been deleted when, in fact, it has persisted.
Modern table formats
Modern data lakes and data lakehouses All of these features and benefits are made possible through the adoption of one of the modern lakehouse table formats:
- Iceberg
- Delta Lake
- Hudi
These formats add a layer of metadata on top of the tables in a data lakehouse that enable data warehouse type features.
Use of modern file formats
It is worth noting that data lakehouses also make use of modern file formats in the way that you would expect from a data lake. These include popular, modern file formats such as:
- Parquet
- ORC
- Avro
Improvements with a lakehouse
In a data lakehouse, enhancements to the governance layer improve the collection of metadata, allowing data to be fully updated and deleted as needed.
This opens up new use-cases previously closed to traditional data lakes. For example, ensures that regulatory requirements, such as GDPR, can be fully implemented using data lake technology. GDPR governance often requires the ability to delete user data upon request.
Beyond Hive
Understanding data lakehouses comes down to understanding their contrast with traditional, Hive-based data lakes. In fact, you can think of the lakehouse as a set of technologies specifically designed to overcome the limitations of Hive.
The video below talks about this evolution and how the dividing line between traditional data lake and modern data lake comes down to table format.
Data lakehouse as an evolution of the data lake, not a replacement
Overall, you should think of the data lakehouse as an evolution of the data lake, not a replacement.
Because of this, adoption of a data lakehouse can be gradual. This is not a technology where users have to rethink everything that they are doing, or do away with established processes. Instead, a lakehouse can fit the natural evolution of an organization’s data usage pattern, while providing many enhancements along the way.