
Building lakehouse with dbt and Trino


Join Starburst on April 17th for the next iteration of our Live Demo Series
Michiel De Smet
Software Engineer
Starburst
Damian Owsianny
Software Engineer
Starburst
Michiel De Smet
Software Engineer
Starburst
Damian Owsianny
Software Engineer
Starburst
Since Trino 431 release, support for CREATE [OR REPLACE] TABLE
statements was added. The relevant issue is available here.
This new capability allows you to redefine tables within object storage connectors like Iceberg and Delta Lake. It enables atomic table replacements, ensuring that concurrent queries always access a valid version of the table. This guarantees seamless transitions and uninterrupted data availability.
Let’s take a closer look below
The CREATE [OR REPLACE] TABLE
statement allows specific object storage connectors (currently Iceberg and Delta Lake) to set a new version of the table with totally different data and structure while retaining the history of the table.
This change leverages these specific object storage connectors’ support for transactions (Iceberg and Delta), permitting atomic table replacements. This means that any queries concurrent with the CREATE OR REPLACE TABLE
operation use either the old or new table version.
Let’s go through an example (we are using Iceberg here, because we believe Iceberg is the table format that is the most easy to use and evolve). When creating a new table or replacing an existing one, the operation is performed atomically. This means your data remains consistent, and users can query the old or new table version seamlessly.
Creating a new table is simple:
CREATE TABLE my_table AS SELECT 'test' a, DOUBLE '38.5' b;
Now let’s overwrite this table with new data and a totally different table definition:
CREATE OR REPLACE TABLE my_table AS SELECT BIGINT '-42' a, DOUBLE '38.5' c;
The last statement will atomically replace the table definition with the new table definition without impacting reads on the first version.
We can now check the versions (snapshots in Iceberg lingo) of this table by executing the following query:
SELECT * FROM "my_table$snapshots" ORDER BY committed_at DESC
Let’s use the snapshot id from the first version to retrieve the data as of that time (SELECT 'test' a, DOUBLE '38.5' b
).
SELECT * FROM my_table FOR VERSION AS OF <snapshot_id>
If you like this old version better than the current one, you can easily roll back to the earlier version by executing the following statement, which will atomically replace the current state with the old state:
CALL <catalog_name>.system.rollback_to_snapshot('<schema_name>', 'my_table', <snapshot_id>)
Voila, now the data in my_table is rolled back to the desired state.
One tool which especially benefits from implementing CREATE [OR REPLACE] TABLE
is the Trino dbt adapter – dbt-trino.
During the dbt run command, SQL queries representing dbt models are executed against Trino. If a table representing a certain dbt model already exists, particular actions are taken depending on materialization type and model configuration.
To take advantage of CREATE [OR REPLACE] TABLE
in Trino, we added a new mode to table materialization in dbt-trino 1.7.1 – replace.
Until now, there were two modes available in table materialization for handling existing tables:
In case of error during CREATE TABLE
operation we still have the original table. Any reads on the older table will fail when the table is dropped. There is a short period of time when the table is not available. Table history is also lost, as we are creating a new table.
Table is unavailable until CREATE TABLE
finishes. This is unsafe operation, as after dropping a table we could encounter an error during table re-creation, and end up with a missing data.
`replace` is superior to both of the old modes, as it safely replaces a table ‘in place’ without additional overhead. The table is always ready to read, without downtime, and we retain its history.
Documentation about `on_table_exists` modes is available here
To better visualize the time when a table is available vs. unavailable, let’s look at the chart below:
Fig. 1 Availability time by on_table_exist mode, for model defined as select * from tpch.sf10.orders. For `replace`, the table is available during the entire operation. For `drop`, the table is unavailable during the entire operation. For `rename`, there is a very short time period during table rename that the table is unavailable. When data is deleted from the dropped table, it will also impact any ongoing reads on that table.
Let’s see the difference between `rename` (default) and `replace` (recommended if connector supports it) mode.
First, let’s see how it works with the default mode, ‘rename’
{{
config(
materialized = 'table',
on_table_exists = 'rename'
)
}}
SELECT * FROM {{ ref('some_sample_seed') }}
Next, let’s see which queries are executed by running `dbt run` after the table has been created:
-- creates new table with __dbt_tmp suffix
create table "iceberg"."default"."cortas_example__dbt_tmp"
...
-- renames old table - adds __dbt_backup suffix
alter table "iceberg"."default"."cortas_example" rename to "iceberg"."default"."cortas_example__dbt_backup"
-- renames new table - removes suffix
alter table "iceberg"."default"."cortas_example__dbt_tmp" rename to "iceberg"."default"."cortas_example"
-- drops old table
drop table if exists "iceberg"."default"."cortas_example__dbt_backup"
Now change `on_table_exists` config to ‘replace’:
{{
config(
materialized = 'table',
on_table_exists = 'replace'
)
}}
SELECT * FROM {{ ref('some_sample_seed') }}
and execute `dbt run` once again. As we can see, only one CREATE [OR REPLACE] TABLE query was executed instead of 4 queries:
-- one CREATE [OR REPLACE] TABLE query
create or replace table "iceberg"."default"."cortas_example"
If you would like to learn more about dbt-trino, there is a blog titled ‘Building a Lakehouse with dbt and Trino‘ authored by Michiel De Smet and Przemek Denkiewicz. It is a great resource for those looking to familiarize themselves with dbt-trino, providing step-by-step examples that you can try on your own. We highly recommend it if you are new to dbt-trino or haven’t yet explored this blog!
Explore the future of data lakes through our comprehensive technical how-to whitepaper.
This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.
These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.
These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.
These cookies allow our website to properly function and in particular will allow you to use its more personal features.
These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.