Explore advanced features of Apache Iceberg

40 mins remaining

1. Tutorial overview

Last Updated: 2024-05-03

Background

Apache Iceberg is an open-source table format for huge analytic datasets. It offers several benefits for managing and querying large datasets efficiently, including, but not limited to:

  • Schema Evolution: Iceberg supports schema evolution, allowing you to add, remove, or modify columns without rewriting the entire dataset. This feature is crucial for maintaining data integrity and compatibility with evolving data schemas.
  • Partitioning: Iceberg supports partitioning, which can significantly improve query performance by organizing data based on specified columns. This feature helps eliminate the need for full-table scans and improves query efficiency.
  • Performance Optimization: Iceberg is optimized for analytical workloads, offering features like columnar storage, data skipping, and metadata caching to improve query performance and reduce resource consumption.

Scope of tutorial

In this tutorial, you will explore Iceberg's advanced features using an airplane dataset. You will learn how tables can evolve in specific ways, including renaming columns, adding additional columns, renaming partitions, and eliminating partitions. At each step, you will see what impact these changes have on the records in the table and investigate how these changes are tracked by Iceberg's metadata files. Finally, you will learn how to compact small files into fewer/larger ones.

Learning objectives

Once you've completed this tutorial, you will be able to:

  • Use UPDATE TABLE and ALTER commands to modify Apache Iceberg tables.
  • Explain how Iceberg handles partition evolution.
  • Use compaction to optimize an Iceberg table.

Prerequisites

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

2. Video: Explore advanced features of Apache Iceberg

The following video walks through all the steps in this tutorial.

You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.

3. Sign into Starburst Galaxy and set Admin role

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

  • Check your role, to ensure that it is set to accountadmin.
  • If it is set to anything else, use the drop-down menu to select the correct role.

4. Create and populate Iceberg table

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

  • Hover over the left-hand navigation menu to expand it.
  • Select Query>>Query editor.
  • Select the aws-us-east-1-free cluster, tmp_cat catalog, and tmp_first_last_postalcode schema.

Step 2: Create Iceberg table

Now you're going to use DDL, or data definition language, to create a new table called planes.

  • Copy and paste the following SQL into the Query editor window:
CREATE TABLE planes (
   tail_number varchar(15),
   name varchar(150),
   color varchar(15)
) WITH ( type = 'iceberg');
  • Click the Run (limit 1000) button to execute the SQL.

Step 3: Add records to table

Let's add two records to the planes table.

  • Run the following SQL to insert the two records into the table:
INSERT INTO planes (tail_number, name)
VALUES
 ('N707JT', 'John Travolta''s Boeing 707'),
 ('N1KE', 'Nike corp jet');
  • Check that the records were inserted successfully:
SELECT * FROM planes;

5. Explore schema evolution with Iceberg

Background

Schema evolution refers to the modification of tables as business rules and source systems are modified over time. Starburst Galaxy supports different modifications to tables including the table name itself, column, and partition changes.

Step 1: Change column name

You can use the ALTER TABLE command to change the name of a column in an Iceberg table.

Notice that the name column from the planes table is really more of a "description". Let's change the name to reflect this.

  • Run the following SQL to rename the name column:
ALTER TABLE planes RENAME COLUMN name TO description;
  • Now query the table to confirm that the column name was changed:
SELECT * FROM planes;

Step 2: Add new record

Add another plane to the table so that you have more data to work with.

  • Run the following SQL to add the record:
INSERT INTO planes (tail_number, color, description)
VALUES
 ('N89TC', 'white',
  '1975 Bombardier Learjet 35 w/Light Jet classification');
  • Be sure to confirm that the new record was added successfully.

Step 3: Add additional columns

You can also use the ALTER TABLE command to add additional columns to your table. Let's add four new columns.

  • Run the following four SQL statements to add the columns. Hint: you can paste them all into one window and choose to Run all (limit 1000).
ALTER TABLE planes ADD COLUMN class varchar(50);
ALTER TABLE planes ADD COLUMN year integer;
ALTER TABLE planes ADD COLUMN make varchar(100);
ALTER TABLE planes ADD COLUMN model varchar(100);
  • Next, remove the color column, as it was determined to be of little importance:
ALTER TABLE planes DROP COLUMN color;
  • Check that the columns were added and removed as desired. Note that the existing records need to be modified to fit the new column layout.

Step 4: Modify existing records

In the previous step, you added and deleted columns, which introduced some NULL values to the existing records. You can use the UPDATE command to modify the records accordingly.

  • Run the following SQL to modify the existing records to leverage the new column layout.
UPDATE planes
   SET class = 'Jet Airliner',
       year = 1964,
       make = 'Boeing',
       model = '707-138B'
 WHERE tail_number = 'N707JT';

UPDATE planes
   SET class = 'Heavy Jet',
       year = 2021,
       make = 'Gulfstream',
       model = 'G650'
 WHERE tail_number = 'N1KE';

UPDATE planes
   SET class = 'Light Jet',
       year = 1975,
       make = 'Bombardier',
       model = 'Learjet 35',
       description = null
 WHERE tail_number = 'N89TC';
  • Verify that the reporting is now correct:
SELECT * FROM planes;

6. Explore partition evolution with Iceberg

Background

Partitioning is a way to organize and manage large volumes of data within a data lake. It involves dividing the data into logical segments based on certain criteria, such as date, location, or any other relevant attribute.

One of the big advantages of Iceberg is how it handles partitions, including allowing for partition evolution.

Step 1: Rename and partition table

Imagine that your company has now decided to carry other types of aircraft, not just planes.

You can use ALTER TABLE to rename the planes table to allow for additional types, such as helicopters, to be added. The new name for the table will be aircraft.

  • Run the following command to rename the table:
ALTER TABLE planes RENAME TO aircraft;
  • Now, add partitioning to the classification column:
ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['class'];

Step 2: Add new records to table

Add two new helicopter records to the updated table.

  • Run the following SQL to add the two new records:
INSERT INTO aircraft
 (tail_number, class, year, make, model, description)
VALUES
 ('N535NA', 'Helicopter', 1969, 'Sikorsky', 'UH-19D', 'NASA'),
 ('N611TV', 'Helicopter', 2022, 'Robinson', 'R66', null);
  • As always, verify that the records were added successfully.

Step 3: Check partition metadata

Let's check the $partitions metadata table to ensure that the metadata for the new records is being collected properly. There should be two records in the new partition.

  • Run the following query to check the partitions:
SELECT partition, record_count, file_count
  FROM "aircraft$partitions";

The second row in the image above indicates that 6 rows were added when there was no partitioning defined. This proves that, as stated previously, changes to the partitioning strategy do not automatically repartition the data previously persisted.

Step 4: Change partition strategy

Suppose that the needs of the business change. It is now determined that analysts are querying across specific years of manufacturing more than the class of the aircraft. You can change the partition strategy to account for this.

  • Run the following SQL to change the partition strategy:
ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['year'];
  • Verify that this did not repartition any existing data by running the query from the last step again:
SELECT partition, record_count, file_count
  FROM "aircraft$partitions";

Step 5: Add records to table

At this point, the aircraft table only has one aircraft for each year of manufacture. You can verify this with a quick query, then add some records to the table so that each year has more than one aircraft.

  • Begin by querying the table to verify how many aircraft there are per year of manufacture:
SELECT year, count() nbr_for_year
  FROM aircraft
 GROUP BY year ORDER BY year;

  • Now, add some records to the table:
INSERT INTO aircraft
 (tail_number, class, year, make, model, description)
VALUES
 ('dummy', 'unknown', 1964, 'acme', 'cool', null),
 ('dummy', 'unknown', 1969, 'acme', 'cool', null),
 ('dummy', 'unknown', 1975, 'acme', 'cool', null),
 ('dummy', 'unknown', 2021, 'acme', 'cool', null),
 ('dummy', 'unknown', 2022, 'acme', 'cool', null);

Step 6: Verify partitioning for new records

Because you added the new records after you updated the partition strategy, they should be persisted with the year partition strategy. You can verify this by querying the $partitions table.

  • Run the following query to show that the 5 new records were added into separate partitions based on their year values:
SELECT partition, record_count, file_count, data
  FROM "aircraft$partitions"
 ORDER BY record_count;

7. Use compaction to optimize table performance

Background

Iceberg includes some file management features that help with performance. Traditional data lakes have scenarios where data is constantly being ingested. This data is written in small files because of the need to have it available to be queried immediately. This can hurt performance in any system that needs to read many small files, especially in cloud storage. Iceberg includes an optimize feature that combines small files into larger ones ensuring maximum performance when it comes to querying. This is known as compaction.

Step 1: Review $files metadata table

The $files metadata table references all files used by the current snapshot. Let's take a look at how many files are in the current snapshot for the aircraft table.

  • Run the following query to view the files:
SELECT file_path, record_count, file_size_in_bytes
 FROM "aircraft$files";

The "merge on read" strategy leaves a sprawl of files as changes continue to be made. If the number of rows being updated or deleted is small for each operation, the size of the delete files and data files can be quite small as we see in the results above.

Even without running modification operations, the frequency and size of periodic, or streaming, ingestion can also create small files. This "small files problem" will ultimately affect performance and scalability. The solution to this is compaction, which is essentially reading multiple, smaller files and rebuilding and replacing them with fewer, larger ones.

Fortunately, we have a simple operation to trigger this maintenance that will eventually be needed. This operation uses a default value (100MB) to decide which files are considered small. Any file with a size below that threshold is included in the compaction operation.

Additionally, this compaction process is aware of the delete files and accounts for these by excluding the original row when rebuilding a file, eliminating the delete file, and rolling the net-new records that were created into larger files.

Step 2: Perform compaction

Let's perform compaction to optimize the size of the underlying files.

  • Run the following SQL to perform compaction:
ALTER TABLE aircraft EXECUTE optimize;
  • Now, check the $files metadata table again:
SELECT file_path, record_count, file_size_in_bytes
  FROM "aircraft$files";

There are fewer files present now, as shown in the image below. These files are still very small in size due to the small number of records inserted into them. Nonetheless, the problem has been greatly reduced.

Step 3: Review partitions again

It might seem like fewer/larger files should have been written. Compaction created 5 files because the table's current partition strategy was utilized.

  • Run this query again to verify this happened.
SELECT partition, record_count, file_count, data
  FROM "aircraft$partitions"
 ORDER BY record_count;

Did we just discover a relatively easy way to repartition pre-existing data after changing the partitioning strategy?

Hint: Yes we did!

8. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of schema and partition evolution with Apache Iceberg. You've also begun exploring data file compaction with Apache Iceberg.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.

Functional/Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.