Last Updated: 2024-09-24

Background

Delta Lake is an open-source table format that brings reliability, performance, and scalability to data lakes. It was originally developed by Databricks and is built on top of Apache Spark, integrating with data lakes on cloud storage systems like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Delta Lake is popular for implementing a lakehouse architecture, which combines the scalability and flexibility of data lakes with the reliability and performance of traditional data warehouses.

Scope of tutorial

In this tutorial you will explore the concept of table formats by learning about Delta Lake. You will create delta lake tables, modify them, and observe changes in the corresponding Delta Log, which records all additions, deletions, and updates made to a Delta Lake table. You will end by investigating how Delta Log files are stored and what kind of information you can gather by looking inside them.

Learning objectives

Once you've completed this tutorial, you will be able to:

Use Starburst Galaxy to create and populate Delta Lake tables.
Use Starburst Galaxy to query Delta Lake tables.
Explain what information is stored in the Delta Log.

Prerequisites

You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

Check your role, to ensure that it is set to accountadmin.
If it is set to anything else, use the drop-down menu to select the correct role.

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

Hover over the left-hand navigation menu to expand it.
Select Query>>Query editor.
Select the aws-us-east-1 cluster, tmp_cat catalog, and tmp_first_last_postalcode schema.

Step 2: Create Delta Lake table

Now you're going to use DDL, or data definition language, to create a new table called my_delta_tbl.

Copy and paste the following SQL into the Query editor window:

CREATE TABLE my_delta_tbl (
   id integer,
   name varchar(55),
   description varchar(255)
) WITH (
   TYPE = 'delta'
);

Note the table TYPE specified in the WITH clause. If you had set your default table format to Delta Lake during catalog creation, you could safely omit the TYPE.
Click the Run (limit 1000) button to execute the SQL.
Your SHOW CREATE TABLE output should look similar to the second image below.

Step 3: Verify creation of Delta Log

Now it's time to verify that you have successfully created the table.

You do not have access to the S3 object store, but if you did you would see that a folder was created for your table. Inside this folder you would see a _delta_log subdirectory. The screenshots below provide a visual example of this structure.

Here is the table folder in S3:

And here is the _delta_log subdirectory:

Inside the _delta_log folder is a single JSON file:

Step 4: Review the contents of the JSON file

This step describes some of the key elements of the metadata file associated with the creation of your Delta Lake table. The primary goals are to give you an awareness of the complexity of Delta Lake as well as present you with a starting point for additional research beyond this exercise. You are not expected to fully understand the exact mechanisms used by Delta Lake from just this exercise.

The JSON file's contents look similar to this:

The text highlighted in green indicates that a new table was created. The text highlighted in purple shows the schema for the table.

Background

The Delta log is Delta Lake's key mechanism for keeping track of changes to a table. A _delta_log folder contains files for each transactional commit (aka version) to include:

Content changes
Structural changes
Metadata

When a query is executed against a Delta Lake table, the system reads the Delta Log to determine the current state of the data. The transaction logs identify which files are needed for each version, which is useful for time-travel.

Data files are located in the table's base folder (or partition folder if applicable). Not all present files are utilized for the current version of the table.

Let's take a look at this in practice.

Step 1: Modify table

You'll begin by adding some records to the Delta Lake table so that you can observe the corresponding additions to the Delta log.

Run the following SQL in the Query editor window to add three records to the table:

INSERT INTO
 my_delta_tbl (id, name, description)
VALUES
 (1, 'one', 'added via 1st INSERT statement'),
 (2, 'two', 'added via 1st INSERT statement'),
 (3, 'three', 'added via 1st INSERT statement');

Step 2: Observe Delta Log changes

The changes made to the table have updated the Delta Log. These changes are saved in a JSON file in your Delta Log directory.

In S3, the table's base folder now has an actual data file alongside the _delta_log subdirectories. The file will look similar to the one pictured below. This file contains the three rows just inserted.

Inside _delta_log, there is now a second JSON file:

Step 3: Inspect Delta Log file

The additional log file contains a record of the changes to the table, saved in JSON.

Review the following image, showing the contents of the new log file. Notes have been added below calling out important portions of the log file.

The information inside the Delta Log file can be useful. For instance, the portion of this log file highlighted in orange indicates that new data was written to the table. The portion highlighted in blue calls out the name of the file where the three new records are stored. The number of new records is highlighted in green.

Step 4: Add three additional records

Test your Delta Log by adding a few more records using the code below.

Note: These are all different values from the prior INSERT statement.

Run the following SQL to add the records:

INSERT INTO
 my_delta_tbl (id, name, description)
VALUES
 (4, 'four', 'added via 2nd INSERT statement'),
 (5, 'five', 'added via 2nd INSERT statement'),
 (6, 'six', 'added via 2nd INSERT statement');

Step 5: Verify all rows are present

Run the following query to verify that all six rows are now in the table:

SELECT * FROM my_delta_tbl ORDER BY id;

The addition of the new records caused a new Delta Log file to be created in S3. A new data file with the new records was also created. The Delta Log is metadata used to track changes in the table itself. The following images show the new data file and new log file that were added to S3:

Background

The information held inside Delta Log files is used to track all updates made to the table. Although some of the information inside these files is beyond the scope of this tutorial, the ways that this metadata is handled are important to bear in mind.

Delta Log updates records in a very particular way. Updating an individual row does not update the row in question on a record-for-record basis. Instead, it records a series of events in the Delta Log which collectively result in an update. These include:

One or more records identifying the removal of any files that were affected by the update
One or more records identifying the addition of any new files containing the updated data

In this way, Delta Log achieves update functionality without having to enact a traditional in-place update.

Let's review how this works in practice.

Step 1: Update rows with additional ID data

To see how Delta Lake enacts updates, you'll update the rows in your table that have an even id value. This should be 3 out of the 6 rows.

Run the following SQL to update the table:

UPDATE
 my_delta_tbl
SET
 description = 'UPPER-CASED name col via 1st UPDATE statement',
 name = UPPER(name)
WHERE
 (id % 2) = 0;

Query the table to see the changes. Your results should look similar to the image below:

SELECT * FROM my_delta_tbl ORDER BY id;

Step 2: Review the associated log file

As a result of the update you just made, one new log file and three new data files were created in the S3 bucket. Review the following breakdown of the contents of the log file to understand what is going on under the hood.

In line A of this log file, there is a declaration of a MERGE being made (indicated by the blue highlight). This tells Delta Lake that it should begin the update process.

Lines B and C identify the change that will take place. Specifically, the two files created through INSERT commands earlier are flagged for removal (indicated by the red highlight). You can see information about the files, as well as the timestamp associated with them.

Lastly, lines D , E, and F identify that replacement files will be added to replace those being deleted (indicated by the green highlight). These new files will reflect the changes made to the table, and their addition will complete the update process.

Step 3: Check your understanding of deltas

As you've seen in the previous steps, Delta Lake maintains a log of changes, or deltas, on the underlying data lake files. This is the process that gives Delta Lake its name!

Now that you've explored how Delta Lake works, try to guess what will happen to the Delta Log and data files when you run the following SQL:

INSERT INTO my_delta_tbl
 SELECT * FROM my_delta_tbl;

If you could look in the S3 bucket, you would see one new data file in the table's base folder. Inside the delta log, you would find one new log file that identifies the new data file that was created, in addition to the data files already making up the contents of the table.

Step 4: Understand how Delta Log tracks deletions

With Delta Lake, deletions are very similar to updates. Suppose we wanted to delete all records with odd ID values.

Run the following SQL to delete records with odd ID values:

DELETE FROM
 my_delta_tbl
WHERE
 (id % 2) = 1;

The DELETE operation caused one new data file to be created in the S3 bucket.

An additional log file was added to the Delta Log. Similar to the log file for the UPDATE operation, the first line of this file shows a MERGE to begin the update process:

The next three lines identify the files that will be removed to accommodate the DELETE, while the final line identifies the replacement file that will be added to reflect the most recent version of the table.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how to create and populate Delta Lake tables as well as the basic layout of the Delta Log.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!