Last Updated: 2024-05-03
Apache Iceberg is an open-source table format for huge analytic datasets. It offers several benefits for managing and querying large datasets efficiently, including, but not limited to:
In this tutorial, you will explore Iceberg's advanced features using an airplane dataset. You will learn how tables can evolve in specific ways, including renaming columns, adding additional columns, renaming partitions, and eliminating partitions. At each step, you will see what impact these changes have on the records in the table and investigate how these changes are tracked by Iceberg's metadata files. Finally, you will learn how to compact small files into fewer/larger ones.
Once you've completed this tutorial, you will be able to:
UPDATE TABLE
and ALTER
commands to modify Apache Iceberg tables.Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
The following video walks through all the steps in this tutorial.
You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.
You're going to begin by signing in to Starburst Galaxy and setting your role.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Your current role is listed in the top right-hand corner of the screen.
In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.
You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.
aws-us-east-1-free
cluster, tmp_cat
catalog, and tmp_first_last_postalcode
schema. Now you're going to use DDL, or data definition language, to create a new table called planes
.
CREATE TABLE planes (
tail_number varchar(15),
name varchar(150),
color varchar(15)
) WITH ( type = 'iceberg');
Let's add two records to the planes
table.
INSERT INTO planes (tail_number, name)
VALUES
('N707JT', 'John Travolta''s Boeing 707'),
('N1KE', 'Nike corp jet');
SELECT * FROM planes;
Schema evolution refers to the modification of tables as business rules and source systems are modified over time. Starburst Galaxy supports different modifications to tables including the table name itself, column, and partition changes.
You can use the ALTER TABLE
command to change the name of a column in an Iceberg table.
Notice that the name
column from the planes
table is really more of a "description". Let's change the name to reflect this.
name
column:ALTER TABLE planes RENAME COLUMN name TO description;
SELECT * FROM planes;
Add another plane to the table so that you have more data to work with.
INSERT INTO planes (tail_number, color, description)
VALUES
('N89TC', 'white',
'1975 Bombardier Learjet 35 w/Light Jet classification');
You can also use the ALTER TABLE
command to add additional columns to your table. Let's add four new columns.
ALTER TABLE planes ADD COLUMN class varchar(50);
ALTER TABLE planes ADD COLUMN year integer;
ALTER TABLE planes ADD COLUMN make varchar(100);
ALTER TABLE planes ADD COLUMN model varchar(100);
color
column, as it was determined to be of little importance:ALTER TABLE planes DROP COLUMN color;
In the previous step, you added and deleted columns, which introduced some NULL values to the existing records. You can use the UPDATE
command to modify the records accordingly.
UPDATE planes
SET class = 'Jet Airliner',
year = 1964,
make = 'Boeing',
model = '707-138B'
WHERE tail_number = 'N707JT';
UPDATE planes
SET class = 'Heavy Jet',
year = 2021,
make = 'Gulfstream',
model = 'G650'
WHERE tail_number = 'N1KE';
UPDATE planes
SET class = 'Light Jet',
year = 1975,
make = 'Bombardier',
model = 'Learjet 35',
description = null
WHERE tail_number = 'N89TC';
SELECT * FROM planes;
Partitioning is a way to organize and manage large volumes of data within a data lake. It involves dividing the data into logical segments based on certain criteria, such as date, location, or any other relevant attribute.
One of the big advantages of Iceberg is how it handles partitions, including allowing for partition evolution.
Imagine that your company has now decided to carry other types of aircraft, not just planes.
You can use ALTER TABLE
to rename the planes table to allow for additional types, such as helicopters, to be added. The new name for the table will be aircraft
.
ALTER TABLE planes RENAME TO aircraft;
ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['class'];
Add two new helicopter records to the updated table.
INSERT INTO aircraft
(tail_number, class, year, make, model, description)
VALUES
('N535NA', 'Helicopter', 1969, 'Sikorsky', 'UH-19D', 'NASA'),
('N611TV', 'Helicopter', 2022, 'Robinson', 'R66', null);
Let's check the $partitions
metadata table to ensure that the metadata for the new records is being collected properly. There should be two records in the new partition.
SELECT partition, record_count, file_count
FROM "aircraft$partitions";
The second row in the image above indicates that 6 rows were added when there was no partitioning defined. This proves that, as stated previously, changes to the partitioning strategy do not automatically repartition the data previously persisted.
Suppose that the needs of the business change. It is now determined that analysts are querying across specific years of manufacturing more than the class of the aircraft. You can change the partition strategy to account for this.
ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['year'];
SELECT partition, record_count, file_count
FROM "aircraft$partitions";
At this point, the aircraft table only has one aircraft for each year of manufacture. You can verify this with a quick query, then add some records to the table so that each year has more than one aircraft.
SELECT year, count() nbr_for_year
FROM aircraft
GROUP BY year ORDER BY year;
INSERT INTO aircraft
(tail_number, class, year, make, model, description)
VALUES
('dummy', 'unknown', 1964, 'acme', 'cool', null),
('dummy', 'unknown', 1969, 'acme', 'cool', null),
('dummy', 'unknown', 1975, 'acme', 'cool', null),
('dummy', 'unknown', 2021, 'acme', 'cool', null),
('dummy', 'unknown', 2022, 'acme', 'cool', null);
Because you added the new records after you updated the partition strategy, they should be persisted with the year
partition strategy. You can verify this by querying the $partitions
table.
SELECT partition, record_count, file_count, data
FROM "aircraft$partitions"
ORDER BY record_count;
Iceberg includes some file management features that help with performance. Traditional data lakes have scenarios where data is constantly being ingested. This data is written in small files because of the need to have it available to be queried immediately. This can hurt performance in any system that needs to read many small files, especially in cloud storage. Iceberg includes an optimize feature that combines small files into larger ones ensuring maximum performance when it comes to querying. This is known as compaction.
$files
metadata table The $files
metadata table references all files used by the current snapshot. Let's take a look at how many files are in the current snapshot for the aircraft
table.
SELECT file_path, record_count, file_size_in_bytes
FROM "aircraft$files";
The "merge on read" strategy leaves a sprawl of files as changes continue to be made. If the number of rows being updated or deleted is small for each operation, the size of the delete files and data files can be quite small as we see in the results above.
Even without running modification operations, the frequency and size of periodic, or streaming, ingestion can also create small files. This "small files problem" will ultimately affect performance and scalability. The solution to this is compaction, which is essentially reading multiple, smaller files and rebuilding and replacing them with fewer, larger ones.
Fortunately, we have a simple operation to trigger this maintenance that will eventually be needed. This operation uses a default value (100MB) to decide which files are considered small. Any file with a size below that threshold is included in the compaction operation.
Additionally, this compaction process is aware of the delete files and accounts for these by excluding the original row when rebuilding a file, eliminating the delete file, and rolling the net-new records that were created into larger files.
Let's perform compaction to optimize the size of the underlying files.
ALTER TABLE aircraft EXECUTE optimize;
$files
metadata table again:SELECT file_path, record_count, file_size_in_bytes
FROM "aircraft$files";
There are fewer files present now, as shown in the image below. These files are still very small in size due to the small number of records inserted into them. Nonetheless, the problem has been greatly reduced.
It might seem like fewer/larger files should have been written. Compaction created 5 files because the table's current partition strategy was utilized.
SELECT partition, record_count, file_count, data
FROM "aircraft$partitions"
ORDER BY record_count;
Did we just discover a relatively easy way to repartition pre-existing data after changing the partitioning strategy?
Hint: Yes we did!
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of schema and partition evolution with Apache Iceberg. You've also begun exploring data file compaction with Apache Iceberg.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!