Last Updated: 2024-09-24

Background

Delta Lake is an open-source table format that brings reliability, performance, and scalability to data lakes. It was originally developed by Databricks and is built on top of Apache Spark, integrating with data lakes on cloud storage systems like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Delta Lake is popular for implementing a lakehouse architecture, which combines the scalability and flexibility of data lakes with the reliability and performance of traditional data warehouses.

Scope of tutorial

In this tutorial you will explore the concept of table formats by learning about Delta Lake. You will create delta lake tables, modify them, and observe changes in the corresponding Delta Log, which records all additions, deletions, and updates made to a Delta Lake table. You will end by investigating how Delta Log files are stored and what kind of information you can gather by looking inside them.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

Step 2: Create Delta Lake table

Now you're going to use DDL, or data definition language, to create a new table called my_delta_tbl.

CREATE TABLE my_delta_tbl (
   id integer,
   name varchar(55),
   description varchar(255)
) WITH (
   TYPE = 'delta'
);

Step 3: Verify creation of Delta Log

Now it's time to verify that you have successfully created the table.

You do not have access to the S3 object store, but if you did you would see that a folder was created for your table. Inside this folder you would see a _delta_log subdirectory. The screenshots below provide a visual example of this structure.

Step 4: Review the contents of the JSON file

This step describes some of the key elements of the metadata file associated with the creation of your Delta Lake table. The primary goals are to give you an awareness of the complexity of Delta Lake as well as present you with a starting point for additional research beyond this exercise. You are not expected to fully understand the exact mechanisms used by Delta Lake from just this exercise.

The text highlighted in green indicates that a new table was created. The text highlighted in purple shows the schema for the table.

Background

The Delta log is Delta Lake's key mechanism for keeping track of changes to a table. A _delta_log folder contains files for each transactional commit (aka version) to include:

When a query is executed against a Delta Lake table, the system reads the Delta Log to determine the current state of the data. The transaction logs identify which files are needed for each version, which is useful for time-travel.

Data files are located in the table's base folder (or partition folder if applicable). Not all present files are utilized for the current version of the table.

Let's take a look at this in practice.

Step 1: Modify table

You'll begin by adding some records to the Delta Lake table so that you can observe the corresponding additions to the Delta log.

INSERT INTO
 my_delta_tbl (id, name, description)
VALUES
 (1, 'one', 'added via 1st INSERT statement'),
 (2, 'two', 'added via 1st INSERT statement'),
 (3, 'three', 'added via 1st INSERT statement');

Step 2: Observe Delta Log changes

The changes made to the table have updated the Delta Log. These changes are saved in a JSON file in your Delta Log directory.

In S3, the table's base folder now has an actual data file alongside the _delta_log subdirectories. The file will look similar to the one pictured below. This file contains the three rows just inserted.

Inside _delta_log, there is now a second JSON file:

Step 3: Inspect Delta Log file

The additional log file contains a record of the changes to the table, saved in JSON.

The information inside the Delta Log file can be useful. For instance, the portion of this log file highlighted in orange indicates that new data was written to the table. The portion highlighted in blue calls out the name of the file where the three new records are stored. The number of new records is highlighted in green.

Step 4: Add three additional records

Test your Delta Log by adding a few more records using the code below.

Note: These are all different values from the prior INSERT statement.

INSERT INTO
 my_delta_tbl (id, name, description)
VALUES
 (4, 'four', 'added via 2nd INSERT statement'),
 (5, 'five', 'added via 2nd INSERT statement'),
 (6, 'six', 'added via 2nd INSERT statement');

Step 5: Verify all rows are present

SELECT * FROM my_delta_tbl ORDER BY id;

The addition of the new records caused a new Delta Log file to be created in S3. A new data file with the new records was also created. The Delta Log is metadata used to track changes in the table itself. The following images show the new data file and new log file that were added to S3:

Background

The information held inside Delta Log files is used to track all updates made to the table. Although some of the information inside these files is beyond the scope of this tutorial, the ways that this metadata is handled are important to bear in mind.

Delta Log updates records in a very particular way. Updating an individual row does not update the row in question on a record-for-record basis. Instead, it records a series of events in the Delta Log which collectively result in an update. These include:

In this way, Delta Log achieves update functionality without having to enact a traditional in-place update.

Let's review how this works in practice.

Step 1: Update rows with additional ID data

To see how Delta Lake enacts updates, you'll update the rows in your table that have an even id value. This should be 3 out of the 6 rows.

UPDATE
 my_delta_tbl
SET
 description = 'UPPER-CASED name col via 1st UPDATE statement',
 name = UPPER(name)
WHERE
 (id % 2) = 0;
SELECT * FROM my_delta_tbl ORDER BY id;

Step 2: Review the associated log file

As a result of the update you just made, one new log file and three new data files were created in the S3 bucket. Review the following breakdown of the contents of the log file to understand what is going on under the hood.

In line A of this log file, there is a declaration of a MERGE being made (indicated by the blue highlight). This tells Delta Lake that it should begin the update process.

Lines B and C identify the change that will take place. Specifically, the two files created through INSERT commands earlier are flagged for removal (indicated by the red highlight). You can see information about the files, as well as the timestamp associated with them.

Lastly, lines D , E, and F identify that replacement files will be added to replace those being deleted (indicated by the green highlight). These new files will reflect the changes made to the table, and their addition will complete the update process.

Step 3: Check your understanding of deltas

As you've seen in the previous steps, Delta Lake maintains a log of changes, or deltas, on the underlying data lake files. This is the process that gives Delta Lake its name!

INSERT INTO my_delta_tbl
 SELECT * FROM my_delta_tbl;

Step 4: Understand how Delta Log tracks deletions

With Delta Lake, deletions are very similar to updates. Suppose we wanted to delete all records with odd ID values.

DELETE FROM
 my_delta_tbl
WHERE
 (id % 2) = 1;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how to create and populate Delta Lake tables as well as the basic layout of the Delta Log.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.