What is the value of a data lakehouse?

Lakehouses offer a massive performance benefit when compared to traditional data lakes, while also offering many additional features.

StrategyDecember 1, 2023

Evan Smith
Technical Content Manager
Starburst Data

Evan Smith
Technical Content Manager
Starburst Data

More deployment options

Request Enterprise trial license key →

In this post, we take a look at the advantages that data lakehouses hold regarding: (1) performance, (2) cost, (3) flexibility, and (4)compliance.

Increased performance

Benefits without sacrifice

Importantly, data lakehouse benefits are achieved without sacrificing anything in return. Organizations that adopt a lakehouse, or modern data lake, use the same cloud based object storage or HDFS as before, but gain significant performance enhancements on top of that alongside additional features.

Open table formats

How does this happen? Lakehouses employ a more modern architecture, which enables organizations to remake their workflows more efficiently. Often, this lets them move away from older technologies altogether, particularly from Hive. Although Hive was once advanced in its time, more modern open table formats like Iceberg and Delta Lake offer better performance and a host of features not offered.

More features, better performance

The two differences–added features and performance–work together. Often, the reason that a lakehouse table format performs better is because the newer features allow workloads to be processed in novel and efficient ways.

For instance, Iceberg and Delta Lake allow users to insert data into a row directly on a record-by-record basis. This ensures that only the changes needed are made, which allows for better workflows, and lets users make more efficient choices. To achieve the same results with Hive, changes would often have to be made at the partition level.

This is a good example of Hive’s architecture creating performance drawbacks which are solved by more modern lakehouse architecture.

Reduced costs

Lakehouses allow businesses to reduce costs and improve query performance simultaneously. This is achieved in a number of ways.

Hive to Iceberg and other table formats

Typically, users migrating to Iceberg or Delta Lake will be moving from Hive. Hive’s architecture is outdated and excludes things like record-level updates. This slows performance, which is bad for productivity, but it also increases spending on cloud resources.

Slow queries increase costs

Longer query times equal more money spent on compute resources. In this way, modern lakehouse architecture not only increases efficiencies but also decreases costs.

Greater flexibility

Lakehouses offer better flexibility. Whereas traditional data lakes based on cloud object storage offer only limited abilities to update or delete records, lakehouses offer full CRUD capabilities. This offers a more database-like experience built on top of the same cloud object storage infrastructure, allowing scenarios either impossible or impractical in a data lake.

Advantages include:

Improved row-level updates
ACID compliance
Enhanced support for transactional systems

Meeting Compliance

Data lakehouses offer better governance and compliance when compared to traditional data lakes. There are a number of reasons for this.

Overcoming immutability problems

Traditional data lakes are built on cloud object storage, a technology which is often immutable. This means that records cannot easily be updated or deleted. This can represent a governance issue, as many jurisdictions require the ability to delete data on request.

Complying with data protection legislation

This can put data lakes in a difficult position when attempting to comply with certain legal requirements. This includes General Data Protection Regulation (GDPR) in the European Union (EU), and the California Consumer Privacy Act (CCPA) in California.

Leveraging record-level details

Data lakehouses solve this issue by including metadata transaction logs and snapshot files detailing all of the changes made to a table. With this log in place, record-level deletions become possible for the first time, along with the ability to roll back the entire database to a previous state, or query the database from a particular time index. All of these features ensure that data lakehouses maintain governance control over the data inside them, helping the organizations involved remain both GDPR and CCPA compliant.