Apache Iceberg

One of the Apache Software Foundation’s open-source projects, the Iceberg table format enables more efficient petabyte-scale data processing by creating an abstracted metadata layer that describes the files in a data lake’s object storage.

12 Apache Iceberg table benefits & features:

Choosing Apache Iceberg as the table format for an open lakehouse provides several benefits:

1. Reduced metastore reliance

  • Iceberg’s catalog avoids Hive Metastore limitations
  • Only used to store a single pointer to the latest snapshot

2. Time travel & Rollbacks

  • Queries are reproducible and can access previous table states
  • Query older “versions” of the table

3. Optimistic concurrency

  • Ensures updates despite multiple concurrent users.
  • Allows multiple writes to occur simultaneously, only checking for conflicts before a final commit.

4. Hidden partitioning

  • Keeps users from writing slow or inaccurate queries.

5. Snapshots

  • Iceberg creates new table versions with every change.

6. Snapshot expiration

  • Sets how long to keep table versions before deletion.

7. Schema evolution

  • Low-impact incorporation of new columns and other changes.
  • Tracks columns by id
  • Add/remove/modify columns
  • Supports nested evolution

8. Performance

  • Table metadata lets queries prune irrelevant metadata and data files.
  • No file listing – just start reading files after planning
  • Data skipping using stats on files & partitioning
  • File metadata includes partition information (partition pruning)

9. Sorted tables

  • Sorting data at write improves filtering and reads
  • Allow tables to be restricted to being sorted by N columns improving performance via data skipping

10. Table optimization

  • Formats support compacting many small files into more efficient large files

11.DML

  • Update
  • Insert
  • Delete

12. Optimize

  • Compact data and combine small files into larger files

Apache Iceberg | Building an open data lakehouse architecture

An open data lakehouse comprises four elements: commodity storage, open file formats, open table formats, and high-performance query engines.

Commodity storage: You can build a lakehouse on storage platforms like Amazon S3. These efficient cloud services provide efficient ways to store different data types at scale. Related reading: Using Apache Iceberg, AWS S3, and AWS Glue to manage a data lakehouse architecture

Open file formats: Various open file formats like Avro, Parquet, and ORC let you optimize how you collect and store data in your lake.

Open table formats: Iceberg is the open table format of the data lakehouse architecture. Its rich metadata files and analytics-optimized structure allow query engines to run more efficiently.

Query engines: High-performance query engines like Apache’s Spark and Trino are optimized for big data analytics.

Iceberg’s table structure

Iceberg tables use metadata, snapshots, and manifests to track individual data files. Any changes to the table are made to these components rather than the data itself. This approach gives Iceberg more robust functionality than predecessors like Apache Hive.

Like a database’s SQL tables, Metadata files describe the table’s schema, partitioning, and other information. They also contain snapshots of the table’s data files. Iceberg generates a new snapshot anytime the table’s state changes. As a result, these tables retain a complete history of state changes.

A manifest file describes the table’s data files. A snapshot may contain multiple manifest files it documents in a manifest list. Iceberg’s approach to manifest files and lists reduces overhead and makes queries more efficient.

What is Apache Iceberg used for?

Companies with petabyte-scale data ecosystems are the primary users of Iceberg. Large datasets place enormous demands on information architectures. Iceberg’s use cases simplify big data management on a data lake.

Simpler data architectures

Data lakes can store unstructured and structured data, making them better suited to modern data analytics demands. However, data lakes cannot match a data warehouse platform’s full suite of analytics capabilities. As a result, companies often layer data warehouses on top of the data lakes. Besides the added complexity, this approach increases costs and the risks associated with data moves and duplication.

Iceberg’s open table format adds rich metadata and query engine compatibility to blend a warehouse’s analytics capabilities with a lake’s storage efficiency. This simpler architecture eliminates data warehouses and turns a lake into the enterprise’s central analytics resource.

Manage complex data processing

Once generated, data takes time to settle. For example, data associated with customer orders can change anywhere from their initial creation to the end of a return window. Regulated personal data must be purged at intervals set by compliance policies. Frequent, small changes to large datasets place enormous workloads on data systems.

Iceberg’s design allows these granular changes to occur without imposing performance penalties.

Concurrent data usage

Enterprise applications and users often need access to the same data simultaneously. However, allowing concurrent access can be risky. When users of a dataset read and write at the same time, the resulting inconsistencies may contaminate downstream analysis.

Iceberg isolates the lake’s raw data through metadata abstraction, instead giving users access to unique snapshots of the data table. Changes result in a new snapshot, but the users can continue using the original snapshot to preserve consistency and repeatability.

What is hidden partitioning in Iceberg?

Many table formats can group data by common properties. This partitioning lets queries skip irrelevant data and return results faster at a lower cost. However, formats like Hive force users to have a deep understanding of table structure and partitioning in order to prevent errors or inaccurate query results.

Iceberg hides aspects of partitioning from users by, for example, automating the creation of partition values and avoiding irrelevant partitions. Hidden partitioning lets Iceberg partitions evolve without affecting queries.

Related reading: Iceberg Partitioning and Performance Optimizations in Trino

Iceberg vs Delta Lake vs Hudi vs Hive

Hive

Hudi

Delta Lake

Iceberg

Original table format Created for time/event series data Open source doesn’t support concurrent writers Hidden Partitions
Supports ORC, Parquet, JSON, etc Great for streaming use cases Only supports Parquet Metadata tree is more performant using AVRO
Partition columns must be part of table Copy on write & on read Can’t change partitioning Partition and table evolution
Relies heavily on the metastore Table evolution, compaction, etc 10 checkpoints every commit; so every 10th write is slower Full DML
Full DML Associated with Databricks Associated with Trino

Advantages of Iceberg over other table formats

Data lakehouses are still relatively recent developments, with solutions based on Apache’s Iceberg or Hudi projects or Databricks’s Delta Table format. These three options have similar functionality, but the devil is in the details.

  • All three table formats are open source. Iceberg and Hudi are Apache projects, although Iceberg has the larger developer community.
  • The Delta Table format, while nominally open source, is primarily supported by Databricks, the corporation that first developed it.
  • Amazon’s AWS, Microsoft Azure, Google Cloud, and other data platforms support all three to varying degrees.
  • Ultimately, the right choice depends upon an enterprise’s existing infrastructure and data use cases.

Related reading:

Hive vs Iceberg

Feature

Apache Iceberg

Apache Hive

Transaction support (ACID) Yes Limited 
File format Parquet, ORC, Avro Parquet, ORC, Avro, and more
Schema evolution Full Partial 
Partition evolution Yes No
Data versioning Yes No
Time travel queries Yes No
Concurrency control Optimistic locking Pessimistic locking
Object store cost optimization Yes No
Community and ecosystem Growing  Established

Developers at Netflix created Iceberg to address the challenges of using Apache Hive on the streaming service’s extensive data infrastructure.

Hive uses a subsystem called a metastore that points to a table’s data. However, it only points to the folder containing the relevant data file. That may be acceptable in a structured environment like a Hadoop-based data warehouse. Using this approach with object storage imposes stiff performance penalties.

Another performance hit comes from how Hive interacts with Hadoop, which relies on Java-based MapReduce jobs to interact with data. Few data consumers have the specialized skills in Java+MapReduce needed to query Hadoop data stores. Hive implements HiveQL to create an SQL-like approach for generating Hadoop queries. However, the Hive approach means every query command requires a translation step between HiveQL and Java.

Netflix’s developers set out to create a new table format that addressed these and other issues Hive creates when analyzing petabytes of data. Eventually, Netflix handed the project over to the Apache Software Foundation, where it has flourished.

Iceberg and Hive query large datasets, but deciding what to use depends on your use case.

Related reading: Hive vs Iceberg: How to migrate your Hive tables to Iceberg

Apache Iceberg vs Apache Parquet

Although both Apache Iceberg and Parquet are open source projects, they address different aspects of the data lakehouse architecture. Whereas Iceberg is an open table format, Parquet is an open file format for creating column-oriented data files on a data lake. This structure compresses more efficiently than a row-oriented format like Avro, which reduces overall storage costs. In addition, Parquet files help speed queries by, for example, providing metadata that queries can use to skip irrelevant data.

Build your open data lakehouse with Starburst with Apache Iceberg Open Table Format

Iceberg’s open table format lets you connect your data lakehouse to any query engine. Starburst’s modern data lake analytics platform lets you connect to any data source. Using Starburst to power the analytics of your Iceberg-based data lake makes data more manageable, optimizes compute and storage investments, and speeds time to insight for more effective decision-making.

Starburst is based on the Trino open source project’s massively parallel query engine but with optimizations designed to maximize the features of Iceberg data tables, including schema evolution, time travel, and partitioning.

Since some workloads work best with different table formats, we created Great Lakes, a connectivity feature of Starburst Galaxy. Great Lakes abstracts the details of a data lake’s table formats and file formats to simplify accessing tables, whether based on Iceberg, Hudi, or Delta Table. Starburst’s Great Lakes lets data teams optimize their data lake architectures for different use cases. Data consumers can run SQL queries without having to know the details of each table’s format.

Demo: Iceberg and Trino

In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.
Iceberg and Trino