
Apache Iceberg DML (update/delete/merge) & Maintenance in Trino

Tom Nats
Director of Customer Solutions
Starburst
Michiel De Smet
Software Engineer
Starburst
Tom Nats
Director of Customer Solutions
Starburst
Michiel De Smet
Software Engineer
Starburst
Table maintenance is necessary for Apache Iceberg tables in order to keep your data optimized and performant. That extra effort is worth the reward in order to utilize Apache Iceberg and Trino to apply data warehouse-like functionality to your own cloud object storage. To keep your Iceberg tables healthy in Trino, certain routine maintenance jobs are required.
These include:
Each of these tasks needs to be performed periodically on these tables. This ensures optimal performance, the removal of out of date snapshots, and a retention threshold to ensure efficiency.
In this blog, we’ll show you how easy it is to manually create your own table-driven maintenance process using Iceberg tables. To do this, you will create an Iceberg table capable of storing parameters for each table as well as a python script to optimize and clean up old snapshots. The script can be executed using any scheduling/orchestration tool of your choice.
Image 1: Architecture diagram for automating table maintenance via a Python script
This article is also the 6th part of an 8 part blog series about Apache Iceberg. Here are the other sections if you are interested in learning more about Apache Iceberg, open source Trino, and Icehouse architecture.
Apache Iceberg’s architecture is powerful, allowing for previously unheard of updates to the data lake like insert, delete, merge, and more. However, without table maintenance, you run the risk of slow performance due to the buildup of out of date snapshot files or small files that should be optimized into larger batches.
If you choose to forego table maintenance on your Iceberg tables, you are going to have lots of unnecessary data and metadata files in your data lake that won’t be utilized. This leads to:
You’ll begin by defining a new Iceberg table to hold information about your other tables, in order to record the desired maintenance parameters for each table. The information in your table definition should include flags that set the criteria for the maintenance customized for each specific table.
The code below defines both the table and the flags.
CREATE TABLE IF NOT EXISTS iceberg_maintenance_schedule (
table_name VARCHAR NOT NULL,
should_analyze INTEGER,
last_analyzed_on TIMESTAMP(6),
days_to_analyze INTEGER,
columns_to_analyze ARRAY(VARCHAR),
should_optimize INTEGER,
last_optimized_on TIMESTAMP(6),
days_to_optimize INTEGER,
should_expire_snapshots INTEGER,
retention_days_snapshots INTEGER,
should_remove_orphan_files INTEGER,
retention_days_orphan_files INTEGER
)
WITH (
type = 'ICEBERG'
);
Next, we’ll populate the table with our initial list of tables that we want to mange:
*** Note: the code below will create this table if it does not already doesn’t exist ***
insert into iceberg_maintenance_schedule values
('customer_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('orders_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('lineitem_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
insert into iceberg_maintenance_schedule values
('iceberg_maintenance_schedule',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
Now we have our table populated with 3 Iceberg tables. Notice we have our own maintenance table included as well.
As new Iceberg tables are created, the values from each table are inserted into this definition table. At the same time, the different options for each flag are added as well, outlining your desired maintenance schedule. This makes it very easy to add and remove new Iceberg tables into the maintenance process.
Note: This blog shows a simplified example that you can build upon. In production environments, each table would typically have different timings for each of the operations, as well as different schedules for each table.
To implement those set maintenance parameters in the iceberg_maintenance_schedule table, you’ll use the Trino Python client to write a script. The script must satisfy three separate performance requirements.
For simplicity, here is a python script we have created for you, so you can experiment with your own as well. This python script, diagrammed below, handles the following tasks.
Image 2: Architecture diagram for task execution via the Python script
Here is an example running the python script:
export NUM_WORKERS=10
export TRINO_HOST=tnats-aws.trino.galaxy.starburst.io
export TRINO_PORT=443
export TRINO_USER=tnats@starburstdata.com/accountadmin
export TRINO_PASSWORD=xxxxxxxxxxxxxxxxx
export TRINO_CATALOG=s3lakehouse
export TRINO_SCHEMA=demo_tpch
/usr/bin/python3 -m trino_iceberg_maintenance
If you were to use Starburst Galaxy to run your script, you get a similar result to the below example.
The desired table maintenance requirements are flexible. Since our maintenance table is also an Iceberg table, we can easily make modifications to this table based on our needs. For example, if we wanted to keep a 14 day history, we would write a simple update to our table using the code below.
update iceberg_mx set retention_days = 14 where id = 1;
Now, our automated process will keep 14 days of history for time travel and revert back to any of those days if needed.
While most of the work is already done, you still need to automate it away and forget about the process until the next update to the trino-python-client. Scheduling the execution of the python script can be done in a variety of ways using a tool of your choice. Usually I use something like Airflow or a similar orchestration tool but I found this neat little utility called Cronitor. They provide a very easy way to monitor cron jobs. The installation is very easy and for each execution of a cronjob, it provides a nice looking dashboard per job as well as different alert targets, email and Slack, to notify you.
Here are the steps to get started with Cronitor if you choose that as your orchestration tool:
curl https://cronitor.io/install-linux?sudo=1 -H "API-KEY:
<UniqueKeyTheyGive You>" | sh
The below is a screenshot that shows my cronjob dashboard for my Iceberg maintenance table:
The crontab commands are below.
# cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh
0 0 * * * cronitor exec bOHpJi cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh
Note: My script runs once a day but you can adjust to run weekly or on any other schedule you choose.
The above process initiates the table maintenance process. This will allow you to enjoy all of the benefits of a fully-featured table format and peak performance during the maintenance process.
The manual process outlined above is powerful and versatile. However, it also requires additional engineering work. This may still not work for you or your organization due to resourcing or prioritization concerns. Or, you are simply looking for an easier method to incorporate table maintenance in your data lake.
In these cases, we have also created an automated table maintenance method using an Icehouse architecture and Starburst Galaxy’s automated table maintenance. This approach provides a data warehouse-like experience on the data lake without any additional overhead.
In fact, automated table maintenance is one of the four pillars of the Icehouse, a data lakehouse architecture built with Trino as the query engine and Iceberg as the table format, providing a fully managed end-to-end open lakehouse platform.
If you are interested in offloading the burden of table maintenance, Starburst Galaxy has introduced data lake optimization – an easy way to set up and run common table maintenance on your Iceberg tables without requiring any additional development work. For a limited time, get three free months of table maintenance with Starburst Galaxy.
Instead of creating an entire new Iceberg maintenance table to manage these statistics, Starburst Galaxy will do this behind the scenes for you. This greatly simplifies the complexity associated with optimizing Iceberg tables, removing the manual lift. This approach will optimize your data size, delete orphaned files, delete data snapshots, and improve performance by collecting statistics about the data. Automated jobs can be run according to the schedule of your choice, in the time zone of your choice, giving you maximum automation and control.
For a limited time, Starburst Galaxy is offering free table maintenance on your Iceberg tables. Sign up for Starburst Galaxy before September and submit the automated maintenance form to start keeping your Iceberg tables performant and healthy.
This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.
These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.
These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.
These cookies allow our website to properly function and in particular will allow you to use its more personal features.
These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.