This post is part of the Apache Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in Trino
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
- Improving performance with Iceberg sorted tables
- Hive vs. Iceberg: Choosing the best table format for your analytics workload
TL;DR
Apache Iceberg is an open source table format that brings high-performance database functionality to object storage such as AWS S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.
What is Apache Iceberg?
Apache Iceberg is an open table format — originally created by Netflix which is now under the Apache Software Foundation — that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses— with reliable ACID transactions — in an open architecture, avoiding vendor and technology lock-in.
The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Apache Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.
Advantages of Iceberg Table Format
One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.
With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.
Here is a list of the many features Iceberg provides:
Choose your engine | As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own datasets and choose the engine that fits your use cases. |
Avoid Data Lock-in | The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in. |
Avoid Vendor Lock-out | Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers. |
DML (modifying table data) | Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR. |
Schema evolution | Much like a database, Iceberg supports full schema evolution including columns and even partitions. |
Performance | Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs. |
Database feel | Partitioning is performed on any column and end users query Iceberg tables just like they would a database. |
Webinar: Hive to Iceberg
This webinar will compare and contrast the architectures of Apache Hive and Apache Iceberg, as well as walk through examples of when migrations would or would not be helpful.
Apache Iceberg Architecture
Iceberg is a layer of metadata over your object storage. It provides a transaction log per table very similar to a traditional database. This log keeps track of the current state of the table including any modifications. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.
Snapshots
Everytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. When an Iceberg client (Trino query engine, let’s say) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax.
Under the covers, Iceberg uses a set of avro based files to keep track of this metadata. A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.
Here is a very basic diagram of the different files that are created during a CTAS (create table as select):
Metadata File Pointer (fp1) – This is an entry in a Hive compatible metastore (AWS Glue for example) that points to the current metadata file. This is the start to any query against an Iceberg table.
Metadata File (mf1) – A json file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.
Manifest List (ml1) – List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.
Manifest File (mf1) – Lists a set of files and metadata about these files. This is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.
Here is a sample table name customer_iceberg that was created on S3
customer_iceberg-a0ae01bc83cb44c5ad068dc3289aa1b9/
data/
20221005_142356_18493_dnvqc-43a7f422-d402-41d8-aab3-38d88f9a8810.orc
20221005_142356_18493_dnvqc-548f81e0-b9c3-4015-99a7-d0f19416e39c.orc
metadata/
00000-8364ea6c-5e89-4b17-a4ea-4187725b8de6.metadata.json
54d59fe-8368-4f5e-810d-4331dd3ee243-m0.avro
snap-2223082798683567304-1-88c32199-6151-4fc7-97d9-ed7d9172d268.avro
Table directory – this is the name of the table with a unique uuid in order to support table renames
Data directory – this holds orc, parquet or avro file formats and could contain subdirectories depending on if the table is partitioned
Metadata directory – ths directory holds the manifest files as covered above
Again, this might be too nitty-gritty for the average user but the point is a tremendous amount of thought and work has been put into Iceberg to ensure it can handle many different types of analytical queries along with real-time ingestion. It was built to fill the gap between low-cost, cloud object stores and the demanding processing engines such as Trino (formerly Presto SQL) and Apache Spark.
Partitioning
Using partitions in Iceberg is just like with a database. Most data you ingest into your data lake has a timestamp and partitioning by that column is very easy:
Example – partition by month from a timestamp column:
create table orders_iceberg
with (partitioning=ARRAY['month(create_date),region'])
Querying using a standard where clause against the partitioned column will result in partition pruning and much higher performance:
select * from orders_iceberg
WHERE CAST(create_date AS date) BETWEEN date '1993-06-01' AND date '1993-11-30';
Trino Iceberg Support
Trino has full support for Iceberg with a feature matrix listed below:
Create Table | ✔ |
Modify Table (update/delete/merge) | ✔ |
Add/Drop/Modify table column | ✔ |
Rename table | ✔ |
Rollback to previous snapshot | ✔ |
View support (includes AWS Glue) | ✔ |
Time travel | ✔ |
Maintenance (Optimize/Expire Snapshots) | ✔ |
Alter table partition | ✔ |
Metadata queries | ✔ |
Using Iceberg in Trino is very easy. There is a dedicated connector page located here. If you are using Starburst Galaxy which is a fully managed Trino platform, Iceberg is already included in the Great Lakes Connector so there isn’t anything you need to add. It’s even the default table format so that’s how excited and confident we are here at Starburst about it.
We highly encourage our customers, prospects, open-source Trino users to explore and start using this new technology. The need to use a proprietary cloud data warehouse is greatly reduced with the addition of Iceberg and robust ecosystem of engines. I believe we’re at the “tip of the iceberg” on wide adoption of this open source project.
Follow and take part in the quickly growing Trino Slack community and don’t forget to join the #iceberg channel as well.
Starburst Data Lakehouse, Powered by Iceberg
Starburst’s data lakehouse platform helps you deliver exceptional user experiences at petabyte scale, without compromising on performance or cost.