Last Updated: 2024-09-04

Background

As the first data lake table format, Apache Hive has long been a popular choice among data engineers. For this reason, it is a good option for those wanting to learn the basics of data lake tables.

Starburst Galaxy supports the Hive table format and can be used to easily create and manage Hive tables as you'll see in this tutorial.

Scope of tutorial

This tutorial will show you how to perform basic database operations with Starburst Galaxy using Hive. This includes the creation of tables, using a number of different methods. The tutorial also introduces the concept of inserts, record updates, and external tables. The limitations of Hive will also be discussed.

Learning objectives

Once you've completed this tutorial, you will be able to:

Use Starburst Galaxy to create managed and external Hive tables.
Describe the resulting file structure in S3 when managed tables are created with Starburst Galaxy.
Explain what happens when you try to update or delete a record in a Hive table.
Explain the difference between external and managed tables.

Prerequisites

You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.

Background

You'll begin your data lake table exploration by using CTAS to create a table. CTAS, or CREATE TABLE AS, creates a table that is populated with the results of a SELECT query. When you use this syntax, you are creating a managed table, with the option to define the location of the table.

In this section, you will create the table within the catalog and schema you created in the prerequisite tutorial.

Step 1: Verify cluster, catalog, and schema

When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query.

Log in to your Starburst Galaxy account.
Use the left-hand navigation menu to select Query>>Query editor.
Use the dropdown menus above the query editor window to select the following:
Cluster: aws-us-east-1-free
Catalog: tmp_cat
Schema: tmp_firstname_lastname_postalcode

Step 2: Create table

Next, it's time to create a new table following the format and contents of the tpch.sf1.nation table.

To do this, start by running a query against the tpch.sf1.nation table to verify there are 25 rows present. This helps you to ensure that you understand what information is held in the table before you recreate the same information.

Run the following query to verify the contents of the tpch.sf1.nation table:

SELECT * FROM tpch.sf1.nation

Now, use the following SQL to create a new table from the tpch.sf1.nation table:

CREATE TABLE
 nation AS
SELECT
 *
FROM
 tpch.sf1.nation;

Step 3: Verify table structure

There are two ways to verify that your new table exists and has the correct structure and content. Let's take a look at both options.

To the left of the query editor window, you'll see a list of all your clusters, catalogs, schemas, and tables in hierarchical order. Expand to find the table you just created and verify its contents, as shown in the first image below.
You can achieve the same results using SQL, as shown in the second image below:

DESCRIBE nation;

Step 4: Verify match between data using the `EXCEPT` command

After running SELECT queries on the original table and your nation table, run the EXCEPT query listed below. It will verify that the two tables contain the same data.

SELECT * FROM nation
EXCEPT
SELECT * FROM tpch.sf1.nation;

Step 5: Generate Data Definition Language (DDL)

When you create a table, you need to define the data structure that will be employed. This is known as the Data Definition Language (DDL). Starburst helps you automate this process.

Find your nation table in the list on the left of the query editor window.
Click the vertical ellipses next to the nation table.
Select the option to SHOW CREATE TABLE. This will copy the SQL to the query editor window.
Run the query.

Step 6: Review DDL output

After running the query in the previous step, you should have received an output similar to the image below. There are two focus areas, indicated by the blue and red highlights.

The first focus area, noted in blue, defines the logical representation of the table. It provides information about the schema that will be applied to the table and is similar to the approach used to construct a table in other database systems.

The second focus area, noted in red, defines two other crucial aspects of schema-on-read solutions, specifically:

The file format being used, in this case, ORC.
The table format used, in this case, Hive.

Step 7: Locate data

Notice that you did not define a specific folder location in the CREATE TABLE statement in the last step. Although a specific location can be defined if needed, it can also be left undefined. When left to the default, it is based on where the schema itself is rooted in the data lake hierarchy.

Recall that when you created the tmp_cat catalog in the prerequisite tutorial, you provided the following details for the Metastore configuration:

By default, the schema is stored in a folder name below the Default directory name. In this case, the default directory name is projects, and the bucket name is starburst-tutorials. The table name is then another folder directly below the schema's folder.

This means your table's base directory is projects/tmp_firstname_lastname_postalcode/nation. If you had access to AWS S3, you would see a directory tree similar to the following:

The nation folder contains data files. In this case, you only have a single file because the table only has 25 rows.

Background

Data Definition Language (DDL) and Data Manipulation Language (DML) are foundational to relational databases. However, they can also be used to create and modify objects in data lakes. Because the structure of a data lake differs from that of a database, there are some key differences in how DDL and DML are employed. In this section, you'll observe some of those differences, beginning with the creation of a new table.

Step 1: Create `friends` table

Let's begin by creating a new, managed table to hold information about your friends.

Use the following SQL to create the table:

CREATE TABLE friends (
   name varchar(100),
   year_met smallint,
   fav_color varchar(50),
   nation_born bigint
);

Notice that we did not define the WITH section. The location of the file system will default to something similar to //filesystem/catalog/schema/table. The folder will not be created until the table is populated with data.

Step 2: Verify file format

If you do not specify a file format in your CREATE TABLE statement, the default format will be used. In this case, the default format is ORC. You can use a SHOW CREATE TABLE statement to verify this.

Run the following SQL to verify the file format:

SHOW CREATE TABLE friends;

Your output should look similar to the following image. Note file format and table type:

Step 3: Add record to `friends` table

Let's add one record to the table for now. Later we'll add more.

Use the following SQL to add a record to your table. Feel free to edit the values to make them unique to a friend of yours:

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Fred', 2002, 'brown', 13);

Confirm that the record was added successfully:

SELECT * FROM friends;

Step 4: View data files

As before, you do not have access to the S3 bucket, but we are providing screenshots of what you would see if you did have access.

Review the image below, taking note of the directory path. When we added records to the table, a friends folder was created, and a data file was created inside of the folder.

Step 5: Add multiple records to table

Now it's time to insert three records into the friends table at once. After, we'll once again provide a screenshot of the S3 bucket for your reference.

Run the following SQL to add three records to the friends table:

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Susie', 2012, 'blue', 9),
 ('Bertha', 1999, 'pink', 24),
 ('Franklin', 2019, 'green', 24);

Verify that the records were added successfully by running a query. There should now be four rows in the table.
If you could see the table's S3 folder, you would notice there are now two files, similar to the image below. One of the files was created with the original INSERT statement and the other from the second INSERT. The three recently inserted records are all located in the second file.

Step 6: Create duplicate record

Let's see what happens if we try to run the first INSERT statement again. Will it create a duplicate record?

Rerun the INSERT statement from Step 3. Here it is again for your reference:

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Fred', 2002, 'brown', 13);

Query the friends table again. What do you see? The image below shows that one of your records has been duplicated. This happened because you did not declare a primary key (PK). It highlights a key point: when working with big data schema-on-read tables such as this, you will not have the chance to create (or validate) a primary key. Instead, data integrity activities like this are managed by the process that populates the table. This will be covered in greater detail in a future data pipelines tutorial.

Step 7: Update a record

Oftentimes, it is necessary to update records in a table. Let's try to do this with the friends table.

Run the following SQL:

UPDATE friends
   SET fav_color = 'purple' 
 WHERE name = 'Susie';

Your SQL should fail, with the error message shown in the image below.

Do you think there will be a problem with trying a DELETE on this same row? Try it and see.

DELETE FROM friends
 WHERE name = 'Susie';

The next error message reads as follows:

We will not address partitions in this tutorial, but Hive tables are generally insert-only in nature. There are, however, alternate "table formats" that allow full INSERT, UPDATE, and DELETE features, as well as usage of the MERGE command.

Additional efforts must be employed with Hive tables to recreate the underlying files from which the table gets queried. Some of these approaches will be covered in a future data pipelines tutorial.

Step 8: Drop friends table

We've completed our work with the friends table, so you can safely drop it.

Run the following SQL to drop the table:

DROP TABLE friends;

Background

Up to this point, you have been creating Hive tables that are sometimes called managed tables. Generally speaking, you let the system identify where the underlying location will be, as the earlier example explained. While each table is configured to a specific directory and has a known file type, you do not typically place files on the filesystem directly. The Trino engine does the heavy lifting for you as a result of a command such as INSERT.

Additionally, when a DROP TABLE command is issued on a managed table, the underlying data is deleted, along with the table's metadata.

External tables are the other primary type of Hive tables. External tables are often located in an alternate data lake location than where Trino would typically store them. You can identify these tables by the presence of the external_location property in the DDL.

Step 1: Create external table

In preparation for this tutorial, we've placed several csv files in our S3 bucket. You are going to create an external table over those files.

Run the following SQL to create the external table. Notice the external_location property telling Starburst where to look for the files:

CREATE TABLE
 external_logs (
   event_time TIMESTAMP,
   ip_address VARCHAR (15),
   app_name VARCHAR (25),
   process_id SMALLINT,
   log_type VARCHAR (15),
   log_level VARCHAR (15),
   message_id VARCHAR (15),
   message_details VARCHAR (555)
 )
WITH
 (
   external_location = 's3://starburst-tutorials/serverlogs/logs_5min_ingest_csv/',
   format = 'TEXTFILE',
   textfile_field_separator = ','
 );

Verify the results from a SELECT statement against this table look appropriate:

SELECT * FROM external_logs;

Step 2: View data files

External tables are often utilized when another system is creating the files. For example, a frequent scenario is an external table being used for recently ingested files that are being loaded into the data lake. In addition to allowing other systems to populate data into the table, these data lake files are easily consumable from other frameworks and systems.

In the ingestion model, you often see similar-sized files placed in the table's underlying directory periodically. Starburst does not care what these files are named, but depending on the system saving the files, you may get some heuristic information from the file names. You might be able to decipher the pattern from these example files being used in the table you just created.

Remember, you do not have access to S3 directly and these screenshots are presented to aid in your understanding.

Review the image below for a sample of the files that were used to create the external_logs table:

Step 3: Drop external table

Recall that when you drop an external table, only the table metadata is deleted; the underlying files remain. Let's test that now.

Run the following SQL to drop the external_logs table:

DROP TABLE external_logs;

Confirm that the table was dropped successfully.
Now, recreate the table using the same SQL from Step 1 above.
Query the table again to confirm that the same data is there. You should see something similar to the following image:

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of the basics of Hive tables, including how to create managed and external tables. You should also be aware of some of the limitations when working with Hive tables.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!