Last Updated: 2024-04-19
A "table format" is an open-source mechanism that manages and tracks all the files and metadata that make up a table. Apache Hive is the first-generation table format which over time has been found to have many limitations.
To address the limitations of Hive, modern table formats such as Apache Iceberg, Delta Lake, and Apache Hudi were designed. These modern formats offer features like ACID transactions, schema evolution, time travel, and improved performance for certain types of operations.
Although Starburst supports all three modern table formats, we recommend Iceberg as our top choice.
In this tutorial, you will learn about the Iceberg table format and its numerous benefits. You'll discover how to create Iceberg tables, add records, and query values using Iceberg. Additionally, you'll explore the metadata tables to understand the information stored for enabling versioning with snapshots.
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
The following video walks through all the steps in this tutorial.
You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.
You're going to begin by signing in to Starburst Galaxy and setting your role.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Your current role is listed in the top right-hand corner of the screen.
In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.
You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.
aws-us-east-1
cluster, tmp_cat
catalog, and tmp_first_last_postalcode
schema. Now you're going to use DDL, or data definition language, to create a new table called my_iceberg_tbl
.
CREATE TABLE my_iceberg_tbl (
id integer,
name varchar(55),
description varchar(255)
) WITH (
TYPE = 'iceberg', FORMAT = 'parquet'
);
TYPE
and file FORMAT
specified in the WITH
clause. If you set your default table format to Iceberg during catalog creation, you could safely omit the TYPE
. It's time to populate the table. We're going to add three records for characters from the popular film series Dune.
INSERT INTO my_iceberg_tbl
(id, name, description)
VALUES
(101, 'Leto', 'Ruler of House Atreides'),
(102, 'Jessica', 'Consort of the Duke'),
(103, 'Paul', 'Son of Leto (aka Dale Cooper)');
SELECT * FROM my_iceberg_tbl;
Iceberg stores table metadata such as schema, partitioning details, statistics, and versioning information on the data lake alongside the actual data files. The following diagram presents the richness of this architectural approach.
A detailed explanation of this architecture can be found in the Apache Iceberg Specification. The key takeaways are as follows:
You can easily query the Iceberg metadata tables to avoid having to inspect the metadata files directly from the data lake.
$history
metadata tableThe $history
table provides a log of the metadata changes performed on the Iceberg table. Let's take a look at how the changes to my_iceberg_tbl
were captured.
SELECT made_current_at,
snapshot_id, parent_id
FROM "my_iceberg_tbl$history"
ORDER BY made_current_at;
The first snapshot_id
in the list refers to the snapshot that was created when the table was created. The next one is from the INSERT
statement execution. Notice that its parent_id
value is the same as the snapshot_id
from the row before it.
$files
metadata tableThe $files
metadata table provides a detailed overview of the data files in the current snapshot.
SELECT
substring(file_path, position('/data/' IN file_path) + 6)
AS file_path,
record_count,
value_counts,
null_value_counts,
lower_bounds,
upper_bounds
FROM
"my_iceberg_tbl$files";
Here is an explanation of some of the information that was returned by this query:
file_path
: A specific file name. The remainder of the columns relate to this particular file.record_count
: The number of records in the file. Ours has three.value_counts
: The number of values in each column. Each of our columns has three values.null_value_counts
: The number of NULL
values in each column. The count is zero for each of our columns.lower_bounds
: The lowest value in each column. For example, the id
field has 101
as a lower bound.upper_bounds
: The highest value in each column. For example, the id
field has 103
as an upper bound.You can see snapshots in action by adding records to the Iceberg table and then querying the metadata tables again.
INSERT INTO my_iceberg_tbl
(id, name, description)
VALUES
(104, 'Thufir', 'Mentat'),
(201, 'Vladimir', 'Ruler of House Harkonnen'),
(202, 'Rabban', 'Ruthless nephew of Vladimir'),
(203, 'Feyd-Rautha', 'Savvy nephew of Vladimir (played by Sting)'),
(301, 'Reverend Mother Gaius Helen Mohiam', null);
SELECT * FROM my_iceberg_tbl
$files
metadata table againIt's time to query the $files
metadata table again to see what has changed now that you've added five records to your Iceberg table.
SELECT
substring(file_path, position('/data/' IN file_path) + 6)
AS file_path,
record_count,
value_counts,
null_value_counts,
lower_bounds,
upper_bounds
FROM
"my_iceberg_tbl$files";
Here is a summary of the information returned by the new file:
record_count
: The record count is now 5.value_counts
: Each column now has 5 values.null_value_counts
: One of the columns now has a NULL
value.lower_bounds
: The id
field now has a lower bound of 104
.upper_bounds
: The id
field now has an upper bound of 301
.You can execute the query against the $history
table again to verify that a new snapshot was created.
SELECT made_current_at,
snapshot_id, parent_id
FROM "my_iceberg_tbl$history"
ORDER BY made_current_at;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of how to create and populate Iceberg tables and some of the basic information available via the metadata tables.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!