Apache Iceberg vs Delta Lake: What are the differences?

Why competition and convergence make your data ecosystem more important than ever

StrategySeptember 16, 2024

Evan Smith

Technical Content Manager

Starburst Data

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Hive vs Iceberg: Choosing the best table format for your analytics workload

The cloud data lakehouse is gaining momentum, driven by the evolution of table formats like Apache Iceberg, Delta Lake, and Hudi. With improved transactional support, ACID compliance, and a range of additional features, data lakehouses appear poised to take center stage in replacing both data warehouses and data lakes. But which table format do you choose? How do you compare Apache Iceberg vs Delta Lake?

To answer this question, you have to look at two factors: competition and convergence. In this sense, competition between data lakehouse table formats is driving convergence in features. This means that once-clear distinctions between these formats and the ecosystems that serve them are blurring, making the choice between Apache Iceberg and Delta Lake more complex than ever.

The emerging data lakehouse era

For Starburst, supporting all modern table formats has been a key focus for a long time. Both Starburst Galaxy and Starburst Enterprise support Apache Iceberg and Delta Lake, along with Hudi and Hive. For others, like Snowflake and Databricks, embracing Iceberg has been a recent phenomenon. In the case of Databricks, this brings Iceberg directly alongside Delta Lake using the Unity catalog, creating even more convergence.

This article will focus on comparing the seemingly incomparable, the two leading table formats:

Apache Iceberg
Delta Lake

By the end, you’ll understand the difference between these two table formats and identify the best match with your organization’s ecosystem.

Apache Iceberg vs Delta Lake: Converging technologies

The first thing to note is that comparisons between Apache Iceberg and Delta Lake have shifted considerably in recent months. In the past, the two technologies stood further apart, with more meaningful differences in their feature offerings, but over time, the two have converged.

How did we get here? Let’s look at the two table formats side-by-side.

What is Apache Iceberg?

Apache Iceberg was designed initially by Netflix as a replacement for Hive data lakes in 2017. From its earliest days, Iceberg ran on Trino clusters. It was engineered to handle larger datasets that changed too frequently for Hive to manage. To achieve this, Iceberg utilizes the same object storage that Hive uses, but collects additional metadata. Iceberg’s metadata management is the key to its architecture, allowing for data warehouse-like functionality using cloud object storage. Since its origins at Netflix, Iceberg is governed by the Apache Foundation and is used as an open-source table format in many data lakehouse solutions.

What is Delta Lake?

Delta Lake was originally developed by Databricks in 2017 as their data lakehouse table format. In some ways, it can be considered the first data lakehouse. The reasons for its development and the problems it was designed to solve were very similar to Iceberg, in that it prioritized:

Ability to update and delete records quickly
ACID compliance
Schema Evolution
Time Travel

Databricks open-sourced Delta Lake under a Linux license in 2019. Following this change, it is no longer a proprietary technology and is now compatible with a variety of data ecosystems, including Starburst.

Apache Iceberg vs Delta Lake

The table below outlines how Apache Iceberg and Delta Lake features compare.

Feature	Apache Iceberg	Delta Lake
Transaction support (ACID)	Yes	Yes
File format	Parquet, ORC, Avro	Parquet
Schema evolution	Full	Partial
Partition evolution	Yes	No
Merge on read	Yes	No
Data versioning	Yes	Yes
Time travel queries	Yes	Yes
Concurrency control	Optimistic locking	Optimistic locking
Object store cost optimization	Yes	Yes
Community and ecosystem	Apache Foundation & Growing	Linux Foundation & Growing

Apache Iceberg vs Delta Lake: Different approaches, similar results

Although Apache Iceberg and Delta Lake are converging on a similar endpoint, they are not identical technologies. You can think of this as an example of convergent evolution, where different approaches to problem-solving result in similar outcomes. Although both achieve efficient storage optimization, data consistency, and performance improvements, the underlying mechanisms they use to achieve this differ.

Let’s look at the key differences in how Apache Iceberg and Delta Lake handle metadata.

Manifest files

Iceberg collects metadata relating to the datasets stored inside it. This metadata is stored in manifest files. Each manifest file points to the data files used in the table, along with the partition data needed to retrieve the data effectively. This creates an historic record of changes made to the data. The record itself is recorded in a tree structure using Avro files.

Because they contain a history of changes made to the state of the data set, manifest files enable all the unique features of Iceberg. Manifest files allow Iceberg to handle:

Updates
Deletes
ACID transactions

Delta Log

Delta Lake also collects enhanced metadata, just like Iceberg. However, instead of manifest files, it stores the data in a directory known as a Delta Log. A Delta Log is recorded in JSON format and written in Parquet Files. This means that the data is structured as a flat list using Parquet spread across many files. Each entry in the log corresponds to a change in the dataset.

Iceberg Manifest files vs Delta Lake Delta Log

Although both Delta Lake and Apache Iceberg maintain a comprehensive history of changes, their approaches differ. Delta Lake uses a Delta Log with JSON files and periodic Parquet checkpoints to track changes. This can make historical data retrieval efficient, but the performance depends on checkpoint frequency. In contrast, Iceberg uses snapshots with manifest files listing the data files and partitions. Both approaches optimize historical data access, though their specific methods and efficiencies vary.

Why your organization’s toolset is now the primary deciding factor when considering Apache Iceberg vs Delta Lake

Given the convergence of Iceberg and Delta Lake in terms of features, the most effective way to compare the two technologies is by examining their distinct ecosystems.

With this, the choice between Apache Iceberg and Delta Lake often comes down to selecting the data ecosystem that best fits your organization’s technical stack. For this, there is no one-size-fits-all solution. Instead, focus on a few key differentiators and determine where the needs of your project, organization, and other factors align.

Let’s look at a few situations and decide who wins in each case.

Spark integration

Delta Lake integrates deeply with Databricks and Spark, making it a strong candidate for organizations heavily invested in those technologies. Because these technologies have been developed within the same ecosystem, they enjoy tight, native integration.

Winner: Delta Lake

Compute engine integration

Apache Iceberg and Delta Lake each work with different compute engines. Like other ecosystem integrations, this one depends on your toolset and objectives. If you are a Spark user, Delta Lake may be the better choice; if you are using an open data stack, Apache Iceberg is still probably best. In either case, you should compare your actual toolset, including compute engine, with each table format to see the differences.

Winner: Depends on the ecosystem

Cloud integration and optimization

Your choice of cloud provider has a direct impact on the performance of the table format. Importantly, there is no clear winner in this case. Instead, different ecosystems support different table formats and compute engines in different ways. To achieve optimal results, compare Apache Iceberg and Delta Lake on the cloud platform you intend to use, such as AWS, Azure, or GCP. We recommend extensive testing and benchmarking to ensure that the system you use is properly optimized for the cloud.

Winner: Depends on the ecosystem

Internal tables vs external table support

Different compute engines integrate differently with Iceberg vs Delta Lake, depending on the ecosystem being used. This means that some engines use internal tables, whereas others rely on external tables. When comparing the two, this difference is consequential, as it may create different performance outcomes.

Winner: Depends on the ecosystem

Multiple Engine Support

Apache Iceberg is designed to be engine-agnostic and works well with various query engines like Starburst, Trino, AWS Athena. This flexibility makes Iceberg particularly attractive for environments that use multiple processing engines or have a mixed analytics strategy.

Delta Lake also supports other engines, though the integration is more beneficial for Spark users.

Winner: Apache Iceberg

Catalog choices

Selecting a table format and compute engine is only part of the job. This used to be a clear win for Apache Iceberg, and it is still the default open choice. However, with recent changes in the Unity catalog, Delta Lake is more open than ever. Unity Catalog’s goal is to improve interoperability across the data ecosystem. It does this by providing a centralized governance layer for managing metadata, data access, and security across multiple data sources, including Delta Lake and Iceberg.

By unifying data access policies across different tools and engines, Unity helps streamline the management of permissions, which can make it easier for external engines to interact with Delta Lake data securely. While this approach enhances how external engines access Delta Lake data, the deep, native functionality remains strongest with Databricks and Spark.

Winner: Depends on the ecosystem

Why Starburst Icehouse Architecture wins when comparing Apache Iceberg vs Delta Lake

What’s the bottom line on Iceberg vs. Delta Lake? For organizations looking to take the leap into a data lakehouse, there have never been more options available, and the two leading table formats have never been so closely aligned in their feature offerings. This means that you have one crucial thing: options.

Compare all your options

What will you do with those options? In general, I’d suggest looking holistically at Apache Iceberg vs Delta Lake and determining whether your data ecosystem benefits one or the other based on toolset, cloud integrations, or compute engines.

Treat it like an experiment. Run benchmarks. Review data. Repeat until you have clarity.

Iceberg is likely to continue leading on openness

What do you do if the results don’t show a clear difference? In this case, you could use either. But in this scenario, I think there’s a case to be made for choosing Apache Iceberg, and specifically an Icehouse architecture based on Trino.

There are a few reasons for this, and most come down to history. Delta Lake is currently becoming a lot more like Iceberg on openness, and this is something to be welcomed and applauded, but Iceberg was there first. This means that there is a clear track record of the Apache Iceberg project itself pushing forward the possibilities of the open data stack, and there’s every reason to believe that this will continue. Databricks and Snowflake have both made pivots towards Apache Iceberg, so the momentum is there on all sides.

Starburst Galaxy supports both Apache Iceberg and Delta Lake

For our part, Starburst Galaxy and Starburst Enterprise have supported both table formats for a long time; however, we still believe Apache Iceberg is the preferred choice for most people. And that’s very much the point. Once upon a time, these table formats were quite different; now, the comparison is much less pronounced. But at the end of the day, you still need to choose a table format, and all signs in the industry point to momentum, openness, and convergence around the paradigm that Apache Iceberg started. That’s reason enough to choose Apache Iceberg vs Delta Lake.

Icehouse architecture suits most users

If you choose Iceberg, then Trino is a clear next step. Iceberg was first designed to run on Trino clusters at Netflix, and that integration carries on. If you’re going to be using Trino and don’t have the resources to spin up a large deployment on your own, then Starburst is the next logical choice, offering a managed platform. And that’s really the benefit of the Icehouse architecture, a straightforward, open data lakehouse approach that lets you benefit from all that this new era promises.

Want to learn more about Icehouse architecture? We offer a range of resources to help you learn, test, and apply this approach using your own Starburst Galaxy cluster.

Why choose Apache Iceberg over Databricks Delta Lake

Learn more

What is an Icehouse?

An Icehouse is a data lakehouse built on top of Apache Iceberg and Trino

Learn more