Trino on Ice IV: Deep Dive Into Iceberg Internals

Welcome to the Trino on ice series, covering the details around how the Iceberg table format works with the Trino query engine. The examples build on each previous post, so it’s recommended to read the posts sequentially and reference them as needed later.

Share

Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals

night sky with a view of an iceberg

Welcome back to the Trino on Ice blog series that has so far covered some very interesting high level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some of the implementation details of Iceberg by dissecting some of the files that result from various operations carried out using Trino.

To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation and you’re looking to see what causes the red errors to fly by on your screen.

Understanding Iceberg metadata

Iceberg can use any compatible metastore, but for Trino, it only supports the Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the plugin used, but if you create a table in Hive, you will also be able to view the same table in Iceberg.

Iceberg metadata diagram of runtime, metastore and file storage

 

To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.

 

Iceberg metadata diagram of inserting records

Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, list the files that Iceberg generated through all these operations and then try to understand what purpose they are serving.
f

% mc tree -f local/

local/

└─ iceberg

   └─ logging.db

      └─ events

         ├─ data

         │  ├─ event_time_day=2021-04-01

         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc

         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc

         │  └─ event_time_day=2021-04-02

         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc

         └─ metadata

            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json

            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json

            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json

            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro

            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro

            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro

            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

There are a lot of files here, but here are a couple of patterns that you will see with these files.

First, the top two directories are named data and metadata

/<bucket>/<database>/<table>/data/
/<bucket>/<database>/<table>/metadata/

 

As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.

/<bucket>/<database>/<table>/metadata/<file-id>.avro
/<bucket>/<database>/<table>/metadata/snap-<snapshot-id>-<version>-<file-id>.avro
/<bucket>/<database>/<table>/metadata/<version>-<commit-UUID>.metadata.json

 

Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.

Iceberg tree structure

 

This tree structure contains two types of avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This blog starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json

The last blog covered the comand in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.

SELECT manifest_list 

FROM iceberg.logging."events$snapshots";

 

Result:

s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro

s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

 

You’ll notice that the manifest list returns the Avro files prefixed with snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.

 

The first snapshot is generated on the creation of the events table. Upon inspecting this file, you will notice that the file is empty. The output is an empty line that the jq json command line utility removes on pretty printing the json that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. In order to investigate these snapshots, it should also be noted that you will need to download these files on your local filesystem. In this blog we’ll just move them to the home directory ~/.

 

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .

 

The second snapshot is a little more interesting and actually shows us the contents of a manifest list.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .
{
  "manifest_path": "s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
  "manifest_length": 6114,
  "partition_spec_id": 0,
  "added_snapshot_id": {
    "long": 2720489016575682000
  },
  "added_data_files_count": {
    "int": 2
  },
  "existing_data_files_count": {
    "int": 0
  },
  "deleted_data_files_count": {
    "int": 0
  },
  "partitions": {
    "array": [
      {
        "contains_null": false,
        "lower_bound": {
          "bytes": "\u001eI\u0000\u0000"
        },
        "upper_bound": {
          "bytes": "\u001fI\u0000\u0000"
        }
      }
    ]
  },
  "added_rows_count": {
    "long": 3
  },
  "existing_rows_count": {
    "long": 0
  },
  "deleted_rows_count": {
    "long": 0
  }
}

 

To understand each of the values in each of these rows, you can refer to the Iceberg specification in the manifest list file section. Instead of covering these exhaustively, the focus will be on a few key fields. Below are the fields, and their definition according to the specification.

manifest_path – Location of the manifest file.

partition_spec_id – ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.

added_snapshot_id – ID of the snapshot where the manifest file was added.

partitions – A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.

added_rows_count – Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.

As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it will only traverse the manifest files where the query values fall within the range of the file values. Finally, you will get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.

The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq '. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }'
{
  "manifest_path": "s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro",