Last Updated: 2024-04-19

Background

A "table format" is an open-source mechanism that manages and tracks all the files and metadata that make up a table. Apache Hive is the first-generation table format which over time has been found to have many limitations.

To address the limitations of Hive, modern table formats such as Apache Iceberg, Delta Lake, and Apache Hudi were designed. These modern formats offer features like ACID transactions, schema evolution, time travel, and improved performance for certain types of operations.

Although Starburst supports all three modern table formats, we recommend Iceberg as our top choice.

Scope of tutorial

In this tutorial, you will learn about the Iceberg table format and its numerous benefits. You'll discover how to create Iceberg tables, add records, and query values using Iceberg. Additionally, you'll explore the metadata tables to understand the information stored for enabling versioning with snapshots.

Learning objectives

Once you've completed this tutorial, you will be able to:

Use Starburst Galaxy to create and populate Iceberg tables.
Use Starburst Galaxy to query Iceberg tables.
Explain what information is stored in the Iceberg metadata tables.

Prerequisites

You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

The following video walks through all the steps in this tutorial.

You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

Check your role, to ensure that it is set to accountadmin.
If it is set to anything else, use the drop-down menu to select the correct role.

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

Hover over the left-hand navigation menu to expand it.
Select Query>>Query editor.
Select the aws-us-east-1 cluster, tmp_cat catalog, and tmp_first_last_postalcode schema.

Step 2: Create Iceberg table

Now you're going to use DDL, or data definition language, to create a new table called my_iceberg_tbl.

Copy and paste the following SQL into the Query editor window:

CREATE TABLE my_iceberg_tbl (
   id integer,
   name varchar(55),
   description varchar(255)
) WITH (
   TYPE = 'iceberg', FORMAT = 'parquet'
);

Note the table TYPE and file FORMAT specified in the WITH clause. If you set your default table format to Iceberg during catalog creation, you could safely omit the TYPE.
Click the Run (limit 1000) button to execute the SQL.

Step 3: Add records to table

It's time to populate the table. We're going to add three records for characters from the popular film series Dune.

Run the following SQL to add the records to your table:

INSERT INTO my_iceberg_tbl
 (id, name, description)
VALUES
 (101, 'Leto', 'Ruler of House Atreides'),
 (102, 'Jessica', 'Consort of the Duke'),
 (103, 'Paul', 'Son of Leto (aka Dale Cooper)');

Run a simple query to verify that the new rows are present:

SELECT * FROM my_iceberg_tbl;

Background

Iceberg stores table metadata such as schema, partitioning details, statistics, and versioning information on the data lake alongside the actual data files. The following diagram presents the richness of this architectural approach.

A detailed explanation of this architecture can be found in the Apache Iceberg Specification. The key takeaways are as follows:

Data and metadata live together inside the table's data lake table location.
The metadata subdirectory contains JSON and AVRO files to represent multiple versions of a table.
The data subdirectory contains the actual data files (supports ORC, Parquet, and AVRO) that span all the various versions identified in the metadata.
Snapshots (i.e. versions) are created anytime the structure or the data changes.

You can easily query the Iceberg metadata tables to avoid having to inspect the metadata files directly from the data lake.

Step 1: Query `$history` metadata table

The $history table provides a log of the metadata changes performed on the Iceberg table. Let's take a look at how the changes to my_iceberg_tbl were captured.

Run the following SQL in the Query editor window:

SELECT made_current_at,
       snapshot_id, parent_id
  FROM "my_iceberg_tbl$history"
 ORDER BY made_current_at;

The first snapshot_id in the list refers to the snapshot that was created when the table was created. The next one is from the INSERT statement execution. Notice that its parent_id value is the same as the snapshot_id from the row before it.

Step 2: Query `$files` metadata table

The $files metadata table provides a detailed overview of the data files in the current snapshot.

Run the following query to see that there is currently only one data file:

SELECT
 substring(file_path, position('/data/' IN file_path) + 6)
   AS file_path,
 record_count,
 value_counts,
 null_value_counts,
 lower_bounds,
 upper_bounds
FROM
 "my_iceberg_tbl$files";

Here is an explanation of some of the information that was returned by this query:

file_path: A specific file name. The remainder of the columns relate to this particular file.
record_count: The number of records in the file. Ours has three.
value_counts: The number of values in each column. Each of our columns has three values.
null_value_counts: The number of NULL values in each column. The count is zero for each of our columns.
lower_bounds: The lowest value in each column. For example, the id field has 101 as a lower bound.
upper_bounds: The highest value in each column. For example, the id field has 103 as an upper bound.

Step 3: Add additional records to table

You can see snapshots in action by adding records to the Iceberg table and then querying the metadata tables again.

Run the following SQL to add records to your Iceberg table:

INSERT INTO my_iceberg_tbl
 (id, name, description)
VALUES
 (104, 'Thufir', 'Mentat'),
 (201, 'Vladimir', 'Ruler of House Harkonnen'),
 (202, 'Rabban', 'Ruthless nephew of Vladimir'),
 (203, 'Feyd-Rautha', 'Savvy nephew of Vladimir (played by Sting)'),
 (301, 'Reverend Mother Gaius Helen Mohiam', null);

Verify that the new records were added:

SELECT * FROM my_iceberg_tbl

Step 5: Query `$files` metadata table again

It's time to query the $files metadata table again to see what has changed now that you've added five records to your Iceberg table.

Run the following query to see that a new file has been added:

SELECT
 substring(file_path, position('/data/' IN file_path) + 6)
   AS file_path,
 record_count,
 value_counts,
 null_value_counts,
 lower_bounds,
 upper_bounds
FROM
 "my_iceberg_tbl$files";

Here is a summary of the information returned by the new file:

record_count: The record count is now 5.
value_counts: Each column now has 5 values.
null_value_counts: One of the columns now has a NULL value.
lower_bounds: The id field now has a lower bound of 104.
upper_bounds: The id field now has an upper bound of 301.

Step 6: Verify additional snapshot was created

You can execute the query against the $history table again to verify that a new snapshot was created.

Run the following query:

SELECT made_current_at,
       snapshot_id, parent_id
  FROM "my_iceberg_tbl$history"
 ORDER BY made_current_at;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how to create and populate Iceberg tables and some of the basic information available via the metadata tables.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!