An introduction and integration: Delta Lake in Starburst Galaxy

  • Lester Martin

    Lester Martin

    Educational Engineer

    Starburst

Share

Delta Lake was initially developed by Databricks and by 2019 evolved to become an open source project. Since then, they’ve created a few key features.

Delta Lake allows classical Data Manipulation Language (DML) statements (INSERT, UPDATE, DELETE, and MERGE) and manages the commit of those operations as an ACID-compliant transaction. It does this by maintaining a transaction log of changes (also known as the DeltaLog) which itself is the enabler for performing time-travel querying since every change creates a new version of the table.

Delta Lake integrates nicely with Trino via the Delta Lake Connector. Of course, what’s the fastest way to get going with Trino? Since you asked… it is Starburst Galaxy

Not only is it already running in the cloud, but you can get started for free!!

You can then leverage its Great Lakes connectivity to use one of the popular public cloud object stores. For a little help with that, check out my querying aviation data in the cloud (leveraging starburst galaxy) post that sets up an Amazon S3 catalog.

Here I am ready to go now! How about you?

As you might be able to see above (if you have stronger glasses than me!), I ran a couple of statements to create a schema to play in and then to root myself to that schema.

-- my catalog is named 'mycloud', so update accordingly
CREATE SCHEMA mycloud.dlblog1;
USE mycloud.dlblog1;

Let’s create a new table. Notice the type property in the WITH clause — Great Lakes connectivity allow multiple table formats to be created and used. I’m just telling it to leverage Delta Lake.

CREATE TABLE dune_characters (
  id integer,
  name varchar(55),
  notes varchar(255)
)
WITH (type = 'delta');

Now, add a few records.

INSERT INTO dune_characters
  (id, name, notes)
VALUES
  (101, 'Leto', 'Ruler of House Atreides'),
  (102, 'Jessica', 'Concubine of the Duke');

Verify these records are present.

Ooof! I forgot the star of the book/movie!! Let’s get him inserted, too.

INSERT INTO dune_characters
  (id, name, notes)
VALUES
  (103, 'Paul', 'Son of Leto');

Instead of just querying the table again to verify it has 3 rows now, here is a peek into the S3 bucket where the table’s data is stored. You can see 2 datasets there (1 from the first INSERT and the 2nd from the next one).

And that _delta_log folder? Yep, it is the mythical DeltaLog we touched upon earlier. Let’s take a peek into it!

Basically, we have 3 different JSON files. One for each version. Yes, the CREATE TABLE created the 000 version. Here are a few relevant snippets from each of these files.

Table Version Relevant Snippets
...000.json "operation":"CREATE TABLE"
...001.json "operation":"WRITE"
\"numRecords\":2
...002.json "operation":"WRITE"
\"numRecords\":1

Sparked (pun definitely intended) your interest? Go back up and watch that hour-long video for more. I promise it can be fun if your the science fair kind of person that enjoys seeing the gory details.

For the rest of us, this is plenty for now!

Blog post originally published, here.