The difference between Hudi and Iceberg
Cindy Ng
Sr. Manager, Content
Starburst
Cindy Ng
Sr. Manager, Content
Starburst
Share
More deployment options
These open table formats address the issues companies increasingly experience with their legacy platforms running Hadoop and Hive. This article will discuss the differences between Hudi and Iceberg and explain how Iceberg is becoming the cornerstone for modern data lakehouse analytics.
What is Apache Hudi?
The Apache Hudi project got its start in 2016 at Uber. The ride sharing company had built a data lake on Hadoop and Hive, but the batch processing pipelines took hours to complete. Traditional streaming stacks excelled at processing row-based data but could not handle the lake’s columnar data. Uber’s solution became Hudi, an incremental processing stack that reduces ingestion latency from hours to minutes.
What is Apache Iceberg?
Around the same time Uber was struggling with Hive, Netflix faced a different set of issues. Hive did not handle changes well, so the streaming service needed a new table format that supported ACID (atomicity, consistency, isolation, durability) transactions. Since becoming an open-source project, Apache Iceberg tables are increasingly the preferred choice for data lakes thanks to benefits like:
- Scalability and performance
- ACID transactions
- Schema evolution
- Time travel
Iceberg also provides significant optionality. It supports open file formats like Avro, ORC, and Parquet. The table format also lets users simultaneously use different query engines, such as Flink, Apache Spark, and Trino.
Why choose Apache Iceberg over Databrick’s Delta Lake
Comparing open table formats: Hudi vs Iceberg vs Delta Lake
Modern open table formats are essential to maximizing the potential of data lakes by supporting a data warehouse’s processing and analytics capabilities on commodity-priced cloud object storage. Organizations can use Iceberg and Hudi with any Hadoop or other distributed file systems. Another open table format, Delta Lake, is also an option but tends to be used within Databricks platforms.
How does Apache Iceberg handle schema evolution compared to Apache Hudi?
Shifting business priorities and a dynamic data environment frequently require changes to table schema. Older formats like Apache Hive impose significant performance penalties by rewriting entire files in ways that impact existing queries. Schema evolution is one of the key features enabled by modern table formats.
Hudi tables, depending on their configurations, use one of two approaches to schema evolution. Copy On Write (COW) uses columnar formats like Parquet files to store data and performs updates by rewriting the file to a new version. COW is the default approach, proven at scale with high-performance query engines like Trino.
Hudi’s experimental Merge on Read (MOR) approach combines columnar data files with row-based files like Avro to log changes for later compaction. MOR provides greater flexibility, especially for changes to nested columns.
Iceberg uses in-place schema evolution to add, remove, rename, update, and reorder columns without table rewrites. Data files don’t have to be touched because changes are recorded within the table’s metadata. In effect, Iceberg provides a transaction log for each table that includes snapshots of the included data files, statistics to improve query performance, and any changes from previous versions.
Iceberg tables and time travel
Iceberg’s metadata-based approach to documenting change enables time travel, the ability for queries to access historical data. Every change results in a new snapshot that captures the current table’s state, but Iceberg tables keep their old snapshots. Queries can access the table’s list of snapshots to return results from older versions. Rollbacks are common use cases for Iceberg’s time travel functionality, allowing a table to be restored to a previous state after a mistaken change.
Hudi vs. Iceberg
Apache Hudi | Apache Iceberg | |
Transaction Support (ACID Compliance) | Full | Full |
File Format | Parquet and ORC | Parquet, ORC, Avro |
Schema Evolution | Full | Full |
Partition Evolution | No | Yes |
Versioning | Yes | Yes |
Time Travel | Yes | Yes |
Materialized Views | No | Yes |
Community & Ecosystem | Growing | Growing |
Integrations | Limited (read-only support for Trino and Hive) | Interoperable |
Use Cases | Streaming data or near real-time ingestion | General purpose data lakehouses |
What table format should data engineers choose for my data lake?
Each table format brings its own advantages and disadvantages, which data engineering teams need to factor into designing their data architectures. Hudi’s origins as a solution to Uber’s data ingestion challenges make it a good choice when you need to optimize data processing pipelines. In contrast, Netflix developed Iceberg to simplify the big data management issues of the Hadoop and Hive ecosystem. As such, migrating to Iceberg tables is ideal for storing large datasets in a data lake.
Iceberg and Trino MPP SQL query engine, Apache Spark
As mentioned earlier, Iceberg lets different query engines access tables concurrently, allowing data teams to use the most appropriate engine for the job. Trino, a fork of Presto, is a massively parallel processing SQL query engine that uses connectors to query large datasets distributed across different sources.
Trino’s Iceberg connector provides full access to Iceberg tables by simply configuring access to a catalog like the Hive Metastore, AWS Glue, a JDBC catalog, or a REST catalog. Trino will connect to Azure Storage, Google Cloud Storage, Amazon S3, or legacy Hadoop platforms.
Amazon S3, AWS and Iceberg
AWS services like Athena, EMR, and Glue support Iceberg tables to various degrees. Athena, for example, requires Iceberg tables to store data in Parquet files and will only work with Glue catalogs.
What is Iceberg in Snowflake?
Snowflake is a proprietary cloud-based data warehouse solution. Recently, the company began developing support for Iceberg, now in public preview. Snowflake’s users can configure the system to be Iceberg’s metadata catalog or use Snowflake to pull snapshots from either a Glue catalog or directly from an object store.
Getting the most out of Snowflake’s implementation, however, requires integrating Iceberg’s metadata into Snowflake at the risk of greater vendor lock-in.
Start building your open data lakehouse powered by Iceberg table formats
Starburst Galaxy is a modern data lakehouse analytics platform founded by the creators of Trino. With features like federation, near-real-time ingestion, accelerated SQL analytics, and more than fifty connectors, Galaxy unifies enterprise data within a single point of access. Big data becomes easier to manage across a globally distributed architecture, improving compliance with GDPR and other data regulations. At the same time, Galaxy makes data more accessible since data consumers can use ANSI-standard SQL or business intelligence tools to query data anywhere in the organization.
Performance of a data warehouse
Starburst builds upon Trino’s massively parallel processing query engine to give data lakehouses the analytics performance of proprietary data warehouses.
A cost-based optimizer takes SQL queries and evaluates the performance and cost implications of different execution plans, choosing the ideal option to meet data teams’ business criteria.
Starburst’s Cached Views create snapshots of frequently-requested query results to reduce costs and apparent latency. From the user’s perspective, the materialized views are indistinguishable from a fresh query run. And with incremental updates, the cached data remains current.
Additional performance features like pushdown queries and dynamic filtering complete queries faster while also reducing network traffic.
Scale of a data lake
Starburst Galaxy fully enables the scalability enterprises need from their data architectures. A data lake’s object storage provides a central repository for ad hoc, interactive, and advanced analytics. However, it can never hold all the data relevant to insight generation.
By federating data lakes and other data sources within a unified access layer, Starburst Galaxy turns the organization’s entire architecture into a distributed data lakehouse.
Starburst Gravity is the platform’s universal discovery, governance, and sharing layer. Gravity’s automatic cataloging system consolidates metadata from every source, turning Starburst into a central access hub across clouds, regions, and sources.
Gravity provides role-based and attribute-based access controls to streamline governance and support fine-grained access policies down the row and column levels.
The advantages of combining the strengths of Starburst’s analytics platform with the benefits of the Iceberg table format are so strong that Iceberg is the default table format when creating tables in Starburst Galaxy.