Apache Iceberg vs Delta Lake: What are the differences?

Share

The cloud data lakehouse is gaining momentum, driven by the evolution of table formats like Apache Iceberg, Delta Lake, and Hudi. With improved transactional support, ACID compliance, and a host of additional features, data lakehouses look set to take center stage in replacing data warehouses and data lakes alike. But which table format do you choose? How do you compare Apache Iceberg vs Delta Lake?

To answer this question, you have to look at two factors: competition and convergence. In this sense, competition between data lakehouse table formats is driving convergence in features. This means that once-clear distinctions between these formats and the ecosystems that serve them are blurring, making the choice between Apache Iceberg and Delta Lake more complex than ever. 

The emerging data lakehouse era 

For Starburst, supporting all modern table formats has been a key focus for a long time. Both Starburst Galaxy and Starburst Enterprise support Apache Iceberg and Delta Lake, along with Hudi and Hive. For others, like Snowflake and Databricks, embracing Iceberg has been a recent phenomenon. In the case of Databricks, this brings Iceberg directly alongside Delta Lake using the Unity catalog, creating even more convergence.  

This article will focus on comparing the seemingly incomparable, the two leading table formats:

  • Apache Iceberg
  • Delta Lake

By the end, you’ll understand the difference between these two table formats and identify the best match with your organization’s ecosystem.  

Apache Iceberg vs Delta Lake: Converging technologies

The first thing to note is that comparisons between Apache Iceberg and Delta Lake have shifted considerably in recent months. In the past, the two technologies stood further apart, with more meaningful differences in their feature offerings, but over time the two have converged. 

How did we get here? Let’s look at the two table formats side-by-side. 

What is Apache Iceberg?

Apache Iceberg was originally designed by Netflix as a replacement for Hive data lakes in 2017. Iceberg ran on Trino clusters. It was engineered to handle larger datasets that changed too frequently for Hive. To do this, Iceberg takes the same object storage that Hive used, but collects additional metadata. Iceberg’s metadata management is the key to its architecture, and allows for data warehouse-like functionality using cloud object storage. Since its origins at Netflix, Iceberg is governed by the Apache foundation, and is used as an open source table format in many data lakehouse solutions. 

What is Delta Lake?

Delta Lake was originally developed by Databricks in 2017 as their data lakehouse table format. In some ways, it can be considered the first data lakehouse. The reasons for its development, and the problems it was designed to solve were very similar to Iceberg, in that it prioritized: 

  • Ability to update and delete records quickly 
  • ACID compliance 
  • Schema Evolution
  • Time Travel 

Databricks open sourced Delta Lake under a Linux license in 2019. Following this change, it is no longer a proprietary technology, and is now compatible with a variety of data ecosystems, including Starburst

Apache Iceberg vs Delta Lake

The table below outlines how Apache Iceberg and Delta Lake compare in a direct feature comparison between a number of different features.  

Feature Apache Iceberg Delta Lake
Transaction support (ACID) Yes Yes
File format Parquet, ORC, Avro Parquet
Schema evolution Full Partial
Partition evolution Yes No
Merge on read Yes No
Data versioning Yes Yes
Time travel queries Yes Yes
Concurrency control Optimistic locking Optimistic locking
Object store cost optimization Yes Yes
Community and ecosystem Apache Foundation & Growing Linux Foundation & Growing

Apache Iceberg vs Delta Lake: Different approaches, similar results

Although Apache Iceberg and Delta Lake are converging on a similar end point, they are not identical technologies. You can think of this as an example of convergent evolution: different approaches to problem solving that result in similar outcomes. Although both achieve efficient storage optimization, data consistency, and performance improvements, the underlying mechanisms they use to achieve this differ.  

Let’s look at the key differences in how Apache Iceberg and Delta Lake handle metadata. 

Manifest files

Iceberg collects metadata relating to the datasets stored inside it. This metadata is stored in manifest files. Each manifest file points to the data files used in the table, along with the partition data needed to retrieve the data effectively. This creates an historic record of changes made to the data. The record itself is recorded in a tree structure using Avro files.

Because they contain a history of changes made to the state of the data set, manifest files make all of the unique features of Iceberg possible. Manifest files allow Iceberg to handle: 

  • Updates
  • Deletes
  • ACID transactions 

Delta Log

Delta Lake also collects enhanced metadata, just like Iceberg. However, instead of manifest files, it stores the data in a directory known as a Delta Log. A Delta Log is recorded in JSON format, and written in Parquet Files. This means that data is structured as a flat list using Parquet spread across many files. Each entry in the log tracks a corresponding change in the dataset.

Iceberg Manifest files vs Delta Lake Delta Log

Although both Delta Lake and Apache Iceberg maintain a comprehensive history of changes, their approaches differ. Delta Lake uses a Delta Log with JSON files and periodic Parquet checkpoints to track changes. This can make historical data retrieval efficient, but the performance depends on checkpoint frequency. In contrast, Iceberg uses snapshots with manifest files listing the data files and partitions. Both approaches optimize historical data access, though their specific methods and efficiencies vary.

Why your organization’s toolset is now the primary deciding factor when considering Apache Iceberg vs Delta Lake

Given the convergence of Iceberg and Delta Lake on features, the best way to compare the two technologies is on ecosystem and toolset integration. 

With this, the choice of Apache Iceberg vs Delta Lake often comes down to choosing the data ecosystem that best fits your organization’s tech stack. For this, there is no one-size fits all solution. Instead, you should focus on a few key differentiators and determine where the needs of your project, organization, and other factors land. 

Let’s look at a few situations and decide who wins in each case.

Spark integration 

Delta Lake integrates deeply with Databricks and Spark, making it a strong candidate for organizations heavily invested in those technologies. Because these technologies have been developed within the same ecosystem, they enjoy tight, native integration.  

Winner: Delta Lake

Compute engine integration

Apache Iceberg and Delta Lake each work with different compute engines. Like other ecosystem integrations, this one depends on your toolset and objectives. If you are a Spark user, Delta Lake may be the better choice; if you are using an open data stack, Apache Iceberg is still probably best. In either case, you should compare your actual toolset, including compute engine, with each table format to see the differences. 

Winner: Depends on ecosystem 

Cloud integration and optimization 

Your choice of cloud provider has a direct impact on a table format’s performance. Importantly, there is no clear winner here. Instead, different ecosystems support different table formats and compute engines differently. To get optimal results, you will need to compare Apache Iceberg vs Delta Lake on the cloud platform you intend to use, whether that is AWS, Azure, or GCP. We recommend extensive testing and benchmarking to ensure that the system you use is properly optimized for the cloud. 

Winner: Depends on ecosystem

Internal tables vs external table support

Different compute engines integrate differently with Iceberg vs Delta Lake depending on the ecosystem being used. This means that some engines use internal tables, whereas others rely on external tables. When comparing the two, this difference is consequential, as it may create different performance outcomes. 

Winner: Depends on ecosystem

Multiple Engine Support 

Apache Iceberg is designed to be engine-agnostic and works well with various query engines like Starburst, Trino, AWS Athena. This flexibility makes Iceberg particularly attractive for environments that use multiple processing engines or have a mixed analytics strategy. 

Delta Lake also supports other engines, though the integration is more beneficial for Spark users. 

Winner: Apache Iceberg

Catalog choices

Selecting a table format and compute engine is only part of the job. This used to be a clear win for Apache Iceberg, and it is still the default open choice. However, with recent changes in the Unity catalog, Delta Lake is more open than ever. Unity Catalog’s goal is to improve interoperability across the data ecosystem. It does this by providing a centralized governance layer for managing metadata, data access, and security across multiple data sources, including Delta Lake and Iceberg. 

By unifying data access policies across different tools and engines, Unity helps streamline the management of permissions, which can make it easier for external engines to interact with Delta Lake data securely. While this approach enhances how external engines access Delta Lake data, the deep, native functionality remains strongest with Databricks and Spark. 

Winner: Depends on ecosystem

Why an Icehouse Architecture benefits most organizations when comparing Apache Iceberg vs Delta Lake

So where does this comparison of Apache Iceberg vs Delta Lake lead? For organizations looking to take the leap into a data lakehouse, there has never been more options available and the two leading table formats have never been so close together in their offering of features. This means that you have one important thing, options. 

Compare all your options

What will you do with those options? In general, I’d suggest looking holistically at Apache Iceberg vs Delta Lake and determining whether your data ecosystem benefits one or the other based on toolset, cloud integrations, or compute engines. 

Treat it like an experiment. Run benchmarks. Review data. Repeat until you have clarity. 

Iceberg is likely to continue leading on openness

What do you do if the results don’t show a clear difference? In this case, you could use either. But in this scenario, I think there’s a case to be made for choosing Apache Iceberg, and specifically an Icehouse architecture based on Trino. 

There are a few reasons for this, and most come down to history. Delta Lake is currently becoming a lot more like Iceberg on openness, and this is something to be welcomed and applauded, but Iceberg was there first. This means that there is a clear track record of the Apache Iceberg project itself pushing forward the possibilities of the open data stack, and there’s every reason to believe that this will continue. Databricks and Snowflake have both made pivots towards Apache Iceberg, so the momentum is there on all sides. 

Starburst Galaxy supports both Apache Iceberg and Delta Lake

For our part, Starburst Galaxy and Starburst Enterprise have supported both table formats for a long time, but we still think Apache Iceberg is the choice for most people, if we had to choose. And that’s very much the point. Once upon a time, these table formats were quite different; now, the comparison is much less pronounced. But at the end of the day, you still need to choose a table format, and all signs in the industry point to momentum, openness, and convergence around the paradigm that Apache Iceberg started. That’s reason enough to choose Apache Iceberg vs Delta Lake. 

Icehouse architecture suits most users

If you choose Iceberg, then Trino is a clear next step. Iceberg was first designed to run on Trino clusters at Netflix, and that integration carries on. If you’re going to be using Trino, and don’t have the resources to spin up a large deployment on your own, then Starburst is the next logical choice, offering a managed platform. And that’s really the benefit of the Icehouse architecture, an easy, open data lakehouse approach that lets you benefit from all that this new era promises. 

Want to learn more about Icehouse architecture? We have a variety of resources to help you learn, test, and employ this approach using your own Starburst Galaxy cluster.

Why choose Apache Iceberg over Databricks Delta Lake

Learn more

What is an Icehouse?

An Icehouse is a data lakehouse built on top of Apache Iceberg and Trino

Learn more