Last Updated: 2024-09-24
Delta Lake is an open-source table format that brings reliability, performance, and scalability to data lakes. It was originally developed by Databricks and is built on top of Apache Spark, integrating with data lakes on cloud storage systems like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Delta Lake is popular for implementing a lakehouse architecture, which combines the scalability and flexibility of data lakes with the reliability and performance of traditional data warehouses.
In this tutorial you will explore the concept of table formats by learning about Delta Lake. You will create delta lake tables, modify them, and observe changes in the corresponding Delta Log, which records all additions, deletions, and updates made to a Delta Lake table. You will end by investigating how Delta Log files are stored and what kind of information you can gather by looking inside them.
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
You're going to begin by signing in to Starburst Galaxy and setting your role.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Your current role is listed in the top right-hand corner of the screen.
In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.
You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.
aws-us-east-1
cluster, tmp_cat
catalog, and tmp_first_last_postalcode
schema. Now you're going to use DDL, or data definition language, to create a new table called my_delta_tbl
.
CREATE TABLE my_delta_tbl (
id integer,
name varchar(55),
description varchar(255)
) WITH (
TYPE = 'delta'
);
TYPE
specified in the WITH
clause. If you had set your default table format to Delta Lake during catalog creation, you could safely omit the TYPE
. SHOW CREATE TABLE
output should look similar to the second image below. Now it's time to verify that you have successfully created the table.
You do not have access to the S3 object store, but if you did you would see that a folder was created for your table. Inside this folder you would see a _delta_log
subdirectory. The screenshots below provide a visual example of this structure.
_delta_log
subdirectory:_delta_log
folder is a single JSON file:This step describes some of the key elements of the metadata file associated with the creation of your Delta Lake table. The primary goals are to give you an awareness of the complexity of Delta Lake as well as present you with a starting point for additional research beyond this exercise. You are not expected to fully understand the exact mechanisms used by Delta Lake from just this exercise.
The text highlighted in green indicates that a new table was created. The text highlighted in purple shows the schema for the table.
The Delta log is Delta Lake's key mechanism for keeping track of changes to a table. A _delta_log
folder contains files for each transactional commit (aka version) to include:
When a query is executed against a Delta Lake table, the system reads the Delta Log to determine the current state of the data. The transaction logs identify which files are needed for each version, which is useful for time-travel.
Data files are located in the table's base folder (or partition folder if applicable). Not all present files are utilized for the current version of the table.
Let's take a look at this in practice.
You'll begin by adding some records to the Delta Lake table so that you can observe the corresponding additions to the Delta log.
INSERT INTO
my_delta_tbl (id, name, description)
VALUES
(1, 'one', 'added via 1st INSERT statement'),
(2, 'two', 'added via 1st INSERT statement'),
(3, 'three', 'added via 1st INSERT statement');
The changes made to the table have updated the Delta Log. These changes are saved in a JSON file in your Delta Log directory.
In S3, the table's base folder now has an actual data file alongside the _delta_log
subdirectories. The file will look similar to the one pictured below. This file contains the three rows just inserted.
Inside _delta_log
, there is now a second JSON file:
The additional log file contains a record of the changes to the table, saved in JSON.
The information inside the Delta Log file can be useful. For instance, the portion of this log file highlighted in orange indicates that new data was written to the table. The portion highlighted in blue calls out the name of the file where the three new records are stored. The number of new records is highlighted in green.
Test your Delta Log by adding a few more records using the code below.
Note: These are all different values from the prior INSERT
statement.
INSERT INTO
my_delta_tbl (id, name, description)
VALUES
(4, 'four', 'added via 2nd INSERT statement'),
(5, 'five', 'added via 2nd INSERT statement'),
(6, 'six', 'added via 2nd INSERT statement');
SELECT * FROM my_delta_tbl ORDER BY id;
The addition of the new records caused a new Delta Log file to be created in S3. A new data file with the new records was also created. The Delta Log is metadata used to track changes in the table itself. The following images show the new data file and new log file that were added to S3:
The information held inside Delta Log files is used to track all updates made to the table. Although some of the information inside these files is beyond the scope of this tutorial, the ways that this metadata is handled are important to bear in mind.
Delta Log updates records in a very particular way. Updating an individual row does not update the row in question on a record-for-record basis. Instead, it records a series of events in the Delta Log which collectively result in an update. These include:
In this way, Delta Log achieves update functionality without having to enact a traditional in-place update.
Let's review how this works in practice.
To see how Delta Lake enacts updates, you'll update the rows in your table that have an even id value. This should be 3 out of the 6 rows.
UPDATE
my_delta_tbl
SET
description = 'UPPER-CASED name col via 1st UPDATE statement',
name = UPPER(name)
WHERE
(id % 2) = 0;
SELECT * FROM my_delta_tbl ORDER BY id;
As a result of the update you just made, one new log file and three new data files were created in the S3 bucket. Review the following breakdown of the contents of the log file to understand what is going on under the hood.
In line A of this log file, there is a declaration of a MERGE
being made (indicated by the blue highlight). This tells Delta Lake that it should begin the update process.
Lines B and C identify the change that will take place. Specifically, the two files created through INSERT
commands earlier are flagged for removal (indicated by the red highlight). You can see information about the files, as well as the timestamp associated with them.
Lastly, lines D , E, and F identify that replacement files will be added to replace those being deleted (indicated by the green highlight). These new files will reflect the changes made to the table, and their addition will complete the update process.
As you've seen in the previous steps, Delta Lake maintains a log of changes, or deltas, on the underlying data lake files. This is the process that gives Delta Lake its name!
INSERT INTO my_delta_tbl
SELECT * FROM my_delta_tbl;
With Delta Lake, deletions are very similar to updates. Suppose we wanted to delete all records with odd ID values.
DELETE FROM
my_delta_tbl
WHERE
(id % 2) = 1;
DELETE
operation caused one new data file to be created in the S3 bucket. UPDATE
operation, the first line of this file shows a MERGE
to begin the update process:DELETE
, while the final line identifies the replacement file that will be added to reflect the most recent version of the table.Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of how to create and populate Delta Lake tables as well as the basic layout of the Delta Log.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!