Automated Table Maintenance for Apache Iceberg Tables

July 18, 2024

Tom Nats
Director of Customer Solutions
Starburst
Michiel De Smet
Software Engineer
Starburst

Tom Nats
Director of Customer Solutions
Starburst
Michiel De Smet
Software Engineer
Starburst

More deployment options

Request Enterprise trial license key →

Table maintenance is necessary for Apache Iceberg tables in order to keep your data optimized and performant. That extra effort is worth the reward in order to utilize Apache Iceberg and Trino to apply data warehouse-like functionality to your own cloud object storage. To keep your Iceberg tables healthy in Trino, certain routine maintenance jobs are required.

These include:

Optimize: rewrites the content of a table so that small files are merged into fewer, larger files
Expire snapshots: removes all snapshots and related metadata and data files
Remove orphan files: removes all files from a table’s data directory that are not linked from metadata files and that are older than the set retention period

Each of these tasks needs to be performed periodically on these tables. This ensures optimal performance, the removal of out of date snapshots, and a retention threshold to ensure efficiency.

In this blog, we’ll show you how easy it is to manually create your own table-driven maintenance process using Iceberg tables. To do this, you will create an Iceberg table capable of storing parameters for each table as well as a python script to optimize and clean up old snapshots. The script can be executed using any scheduling/orchestration tool of your choice.

Image 1: Architecture diagram for automating table maintenance via a Python script

This article is also the 6th part of an 8 part blog series about Apache Iceberg. Here are the other sections if you are interested in learning more about Apache Iceberg, open source Trino, and Icehouse architecture.

Introduction to Apache Iceberg in Trino
Iceberg Partitioning and Performance Optimizations in Trino
Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
Apache Iceberg Schema Evolution in Trino
Apache Iceberg Time Travel & Rollbacks in Trino
Automated maintenance for Apache Iceberg tables in Starburst Galaxy
Improving performance with Iceberg sorted tables
Hive vs. Iceberg: Choosing the best table format for your analytics workload

Risks of Not Performing Table Maintenance

Apache Iceberg’s architecture is powerful, allowing for previously unheard of updates to the data lake like insert, delete, merge, and more. However, without table maintenance, you run the risk of slow performance due to the buildup of out of date snapshot files or small files that should be optimized into larger batches.

If you choose to forego table maintenance on your Iceberg tables, you are going to have lots of unnecessary data and metadata files in your data lake that won’t be utilized. This leads to:

Increased costs
Decreased performance
Increased time to query execution.

Creating a Table Definition

You’ll begin by defining a new Iceberg table to hold information about your other tables, in order to record the desired maintenance parameters for each table. The information in your table definition should include flags that set the criteria for the maintenance customized for each specific table.

The code below defines both the table and the flags.

CREATE TABLE IF NOT EXISTS iceberg_maintenance_schedule (
    table_name VARCHAR NOT NULL,
    should_analyze INTEGER,
    last_analyzed_on TIMESTAMP(6),
    days_to_analyze INTEGER,
    columns_to_analyze ARRAY(VARCHAR),
    should_optimize INTEGER,
    last_optimized_on TIMESTAMP(6),
    days_to_optimize INTEGER,
    should_expire_snapshots INTEGER,
    retention_days_snapshots INTEGER,
    should_remove_orphan_files INTEGER,
    retention_days_orphan_files INTEGER
)
WITH (
   type = 'ICEBERG'
);

Next, we’ll populate the table with our initial list of tables that we want to mange:

*** Note: the code below will create this table if it does not already doesn’t exist ***

insert into iceberg_maintenance_schedule values
('customer_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('orders_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('lineitem_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('iceberg_maintenance_schedule',1,NULL,7,NULL,1,NULL,7,1,7,1,7);

Now we have our table populated with 3 Iceberg tables. Notice we have our own maintenance table included as well.

As new Iceberg tables are created, the values from each table are inserted into this definition table. At the same time, the different options for each flag are added as well, outlining your desired maintenance schedule. This makes it very easy to add and remove new Iceberg tables into the maintenance process.

Note: This blog shows a simplified example that you can build upon. In production environments, each table would typically have different timings for each of the operations, as well as different schedules for each table.

Writing a Python Script

To implement those set maintenance parameters in the iceberg_maintenance_schedule table, you’ll use the Trino Python client to write a script. The script must satisfy three separate performance requirements.

First, execute an optimize
Next, delete old snapshots
Last, analyze commands logging the maintenance completion

For simplicity, here is a python script we have created for you, so you can experiment with your own as well. This python script, diagrammed below, handles the following tasks.

Image 2: Architecture diagram for task execution via the Python script

Reads the iceberg_maintenance_schedule table and processes each row individually.
Executes the following tasks on each row based on the parameters from the table:
1. Optimize
2. Remove old snapshots
3. Analyze
Updates the metadata columns. (For example, last_optimized)

Here is an example running the python script:

export NUM_WORKERS=10
export TRINO_HOST=tnats-aws.trino.galaxy.starburst.io
export TRINO_PORT=443
export TRINO_USER=tnats@starburstdata.com/accountadmin
export TRINO_PASSWORD=xxxxxxxxxxxxxxxxx
export TRINO_CATALOG=s3lakehouse
export TRINO_SCHEMA=demo_tpch
/usr/bin/python3 -m trino_iceberg_maintenance

If you were to use Starburst Galaxy to run your script, you get a similar result to the below example.

Making Maintenance Table Changes

The desired table maintenance requirements are flexible. Since our maintenance table is also an Iceberg table, we can easily make modifications to this table based on our needs. For example, if we wanted to keep a 14 day history, we would write a simple update to our table using the code below.

update iceberg_mx set retention_days = 14 where id = 1;

Now, our automated process will keep 14 days of history for time travel and revert back to any of those days if needed.

Setting Up Scheduling for Table Maintenance

While most of the work is already done, you still need to automate it away and forget about the process until the next update to the trino-python-client. Scheduling the execution of the python script can be done in a variety of ways using a tool of your choice. Usually I use something like Airflow or a similar orchestration tool but I found this neat little utility called Cronitor. They provide a very easy way to monitor cron jobs. The installation is very easy and for each execution of a cronjob, it provides a nice looking dashboard per job as well as different alert targets, email and Slack, to notify you.

Here are the steps to get started with Cronitor if you choose that as your orchestration tool:

Visit https://cronitor.io and sign up for a free account.

Install the cronitor program on your Linux VM or Mac.

curl https://cronitor.io/install-linux?sudo=1 -H "API-KEY: 
<UniqueKeyTheyGive You>" | sh

Now, simply run “cronitor discover” and it will go through any existing crontabs you have and ask you to name each one.
From there, you will get a nice dashboard showing you the different runs for each cronjob. You can also set up additional alert targets for your preferred communication method.

The below is a screenshot that shows my cronjob dashboard for my Iceberg maintenance table:

The crontab commands are below.

# cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh

0 0 * * * cronitor exec bOHpJi cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh

Note: My script runs once a day but you can adjust to run weekly or on any other schedule you choose.

The above process initiates the table maintenance process. This will allow you to enjoy all of the benefits of a fully-featured table format and peak performance during the maintenance process.

Automated Table Maintenance with Icehouse

The manual process outlined above is powerful and versatile. However, it also requires additional engineering work. This may still not work for you or your organization due to resourcing or prioritization concerns. Or, you are simply looking for an easier method to incorporate table maintenance in your data lake.

In these cases, we have also created an automated table maintenance method using an Icehouse architecture and Starburst Galaxy’s automated table maintenance. This approach provides a data warehouse-like experience on the data lake without any additional overhead.

In fact, automated table maintenance is one of the four pillars of the Icehouse, a data lakehouse architecture built with Trino as the query engine and Iceberg as the table format, providing a fully managed end-to-end open lakehouse platform.

If you are interested in offloading the burden of table maintenance, Starburst Galaxy has introduced data lake optimization – an easy way to set up and run common table maintenance on your Iceberg tables without requiring any additional development work. For a limited time, get three free months of table maintenance with Starburst Galaxy.

Instead of creating an entire new Iceberg maintenance table to manage these statistics, Starburst Galaxy will do this behind the scenes for you. This greatly simplifies the complexity associated with optimizing Iceberg tables, removing the manual lift. This approach will optimize your data size, delete orphaned files, delete data snapshots, and improve performance by collecting statistics about the data. Automated jobs can be run according to the schedule of your choice, in the time zone of your choice, giving you maximum automation and control.

For a limited time, Starburst Galaxy is offering free table maintenance on your Iceberg tables. Sign up for Starburst Galaxy before September and submit the automated maintenance form to start keeping your Iceberg tables performant and healthy.