Apache Iceberg
12 Apache Iceberg table benefits & features:
Choosing Apache Iceberg as the table format for an open lakehouse provides several benefits:
1. Reduced metastore reliance
- Iceberg’s catalog avoids Hive Metastore limitations
- Only used to store a single pointer to the latest snapshot
2. Time travel & Rollbacks
- Queries are reproducible and can access previous table states
- Query older “versions” of the table
3. Optimistic concurrency
- Ensures updates despite multiple concurrent users.
- Allows multiple writes to occur simultaneously, only checking for conflicts before a final commit.
4. Hidden partitioning
- Keeps users from writing slow or inaccurate queries.
5. Snapshots
- Iceberg creates new table versions with every change.
6. Snapshot expiration
- Sets how long to keep table versions before deletion.
7. Schema evolution
- Low-impact incorporation of new columns and other changes.
- Tracks columns by id
- Add/remove/modify columns
- Supports nested evolution
8. Performance
- Table metadata lets queries prune irrelevant metadata and data files.
- No file listing – just start reading files after planning
- Data skipping using stats on files & partitioning
- File metadata includes partition information (partition pruning)
9. Sorted tables
- Sorting data at write improves filtering and reads
- Allow tables to be restricted to being sorted by N columns improving performance via data skipping
10. Table optimization
- Formats support compacting many small files into more efficient large files
11.DML
- Update
- Insert
- Delete
12. Optimize
- Compact data and combine small files into larger files
Apache Iceberg | Building an open data lakehouse architecture
An open data lakehouse comprises four elements: commodity storage, open file formats, open table formats, and high-performance query engines.
Commodity storage: You can build a lakehouse on storage platforms like Amazon S3. These efficient cloud services provide efficient ways to store different data types at scale. Related reading: Using Apache Iceberg, AWS S3, and AWS Glue to manage a data lakehouse architecture
Open file formats: Various open file formats like Avro, Parquet, and ORC let you optimize how you collect and store data in your lake.
Open table formats: Iceberg is the open table format of the data lakehouse architecture. Its rich metadata files and analytics-optimized structure allow query engines to run more efficiently.
Query engines: High-performance query engines like Apache’s Spark and Trino are optimized for big data analytics.
Iceberg’s table structure
Iceberg tables use metadata, snapshots, and manifests to track individual data files. Any changes to the table are made to these components rather than the data itself. This approach gives Iceberg more robust functionality than predecessors like Apache Hive.
Like a database’s SQL tables, Metadata files describe the table’s schema, partitioning, and other information. They also contain snapshots of the table’s data files. Iceberg generates a new snapshot anytime the table’s state changes. As a result, these tables retain a complete history of state changes.
A manifest file describes the table’s data files. A snapshot may contain multiple manifest files it documents in a manifest list. Iceberg’s approach to manifest files and lists reduces overhead and makes queries more efficient.
What is Apache Iceberg used for?
Companies with petabyte-scale data ecosystems are the primary users of Iceberg. Large datasets place enormous demands on information architectures. Iceberg’s use cases simplify big data management on a data lake.
Simpler data architectures
Data lakes can store unstructured and structured data, making them better suited to modern data analytics demands. However, data lakes cannot match a data warehouse platform’s full suite of analytics capabilities. As a result, companies often layer data warehouses on top of the data lakes. Besides the added complexity, this approach increases costs and the risks associated with data moves and duplication.
Iceberg’s open table format adds rich metadata and query engine compatibility to blend a warehouse’s analytics capabilities with a lake’s storage efficiency. This simpler architecture eliminates data warehouses and turns a lake into the enterprise’s central analytics resource.
Manage complex data processing
Once generated, data takes time to settle. For example, data associated with customer orders can change anywhere from their initial creation to the end of a return window. Regulated personal data must be purged at intervals set by compliance policies. Frequent, small changes to large datasets place enormous workloads on data systems.
Iceberg’s design allows these granular changes to occur without imposing performance penalties.
Concurrent data usage
Enterprise applications and users often need access to the same data simultaneously. However, allowing concurrent access can be risky. When users of a dataset read and write at the same time, the resulting inconsistencies may contaminate downstream analysis.
Iceberg isolates the lake’s raw data through metadata abstraction, instead giving users access to unique snapshots of the data table. Changes result in a new snapshot, but the users can continue using the original snapshot to preserve consistency and repeatability.
What is hidden partitioning in Iceberg?
Many table formats can group data by common properties. This partitioning lets queries skip irrelevant data and return results faster at a lower cost. However, formats like Hive force users to have a deep understanding of table structure and partitioning in order to prevent errors or inaccurate query results.
Iceberg hides aspects of partitioning from users by, for example, automating the creation of partition values and avoiding irrelevant partitions. Hidden partitioning lets Iceberg partitions evolve without affecting queries.
Related reading: Iceberg Partitioning and Performance Optimizations in Trino
Iceberg vs Delta Lake vs Hudi vs Hive
Hive |
Hudi |
Delta Lake |
Iceberg |
Original table format | Created for time/event series data | Open source doesn’t support concurrent writers | Hidden Partitions |
Supports ORC, Parquet, JSON, etc | Great for streaming use cases | Only supports Parquet | Metadata tree is more performant using AVRO |
Partition columns must be part of table | Copy on write & on read | Can’t change partitioning | Partition and table evolution |
Relies heavily on the metastore | Table evolution, compaction, etc | 10 checkpoints every commit; so every 10th write is slower | Full DML |
Full DML | Associated with Databricks | Associated with Trino |
Advantages of Iceberg over other table formats
Data lakehouses are still relatively recent developments, with solutions based on Apache’s Iceberg or Hudi projects or Databricks’s Delta Table format. These three options have similar functionality, but the devil is in the details.
- All three table formats are open source. Iceberg and Hudi are Apache projects, although Iceberg has the larger developer community.
- The Delta Table format, while nominally open source, is primarily supported by Databricks, the corporation that first developed it.
- Amazon’s AWS, Microsoft Azure, Google Cloud, and other data platforms support all three to varying degrees.
- Ultimately, the right choice depends upon an enterprise’s existing infrastructure and data use cases.
Related reading:
Hive vs Iceberg
Feature |
Apache Iceberg |
Apache Hive |
Transaction support (ACID) | Yes | Limited |
File format | Parquet, ORC, Avro | Parquet, ORC, Avro, and more |
Schema evolution | Full | Partial |
Partition evolution | Yes | No |
Data versioning | Yes | No |
Time travel queries | Yes | No |
Concurrency control | Optimistic locking | Pessimistic locking |
Object store cost optimization | Yes | No |
Community and ecosystem | Growing | Established |
Developers at Netflix created Iceberg to address the challenges of using Apache Hive on the streaming service’s extensive data infrastructure.
Hive uses a subsystem called a metastore that points to a table’s data. However, it only points to the folder containing the relevant data file. That may be acceptable in a structured environment like a Hadoop-based data warehouse. Using this approach with object storage imposes stiff performance penalties.
Another performance hit comes from how Hive interacts with Hadoop, which relies on Java-based MapReduce jobs to interact with data. Few data consumers have the specialized skills in Java+MapReduce needed to query Hadoop data stores. Hive implements HiveQL to create an SQL-like approach for generating Hadoop queries. However, the Hive approach means every query command requires a translation step between HiveQL and Java.
Netflix’s developers set out to create a new table format that addressed these and other issues Hive creates when analyzing petabytes of data. Eventually, Netflix handed the project over to the Apache Software Foundation, where it has flourished.
Iceberg and Hive query large datasets, but deciding what to use depends on your use case.
Related reading: Hive vs Iceberg: How to migrate your Hive tables to Iceberg
Apache Iceberg vs Apache Parquet
Although both Apache Iceberg and Parquet are open source projects, they address different aspects of the data lakehouse architecture. Whereas Iceberg is an open table format, Parquet is an open file format for creating column-oriented data files on a data lake. This structure compresses more efficiently than a row-oriented format like Avro, which reduces overall storage costs. In addition, Parquet files help speed queries by, for example, providing metadata that queries can use to skip irrelevant data.
Build your open data lakehouse with Starburst with Apache Iceberg Open Table Format
Iceberg’s open table format lets you connect your data lakehouse to any query engine. Starburst’s modern data lake analytics platform lets you connect to any data source. Using Starburst to power the analytics of your Iceberg-based data lake makes data more manageable, optimizes compute and storage investments, and speeds time to insight for more effective decision-making.
Starburst is based on the Trino open source project’s massively parallel query engine but with optimizations designed to maximize the features of Iceberg data tables, including schema evolution, time travel, and partitioning.
Since some workloads work best with different table formats, we created Great Lakes, a connectivity feature of Starburst Galaxy. Great Lakes abstracts the details of a data lake’s table formats and file formats to simplify accessing tables, whether based on Iceberg, Hudi, or Delta Table. Starburst’s Great Lakes lets data teams optimize their data lake architectures for different use cases. Data consumers can run SQL queries without having to know the details of each table’s format.
Demo: Iceberg and Trino
In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.