Last Updated: 2024-09-04

Background

As the first data lake table format, Apache Hive has long been a popular choice among data engineers. For this reason, it is a good option for those wanting to learn the basics of data lake tables.

Starburst Galaxy supports the Hive table format and can be used to easily create and manage Hive tables as you'll see in this tutorial.

Scope of tutorial

This tutorial will show you how to perform basic database operations with Starburst Galaxy using Hive. This includes the creation of tables, using a number of different methods. The tutorial also introduces the concept of inserts, record updates, and external tables. The limitations of Hive will also be discussed.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

Background

You'll begin your data lake table exploration by using CTAS to create a table. CTAS, or CREATE TABLE AS, creates a table that is populated with the results of a SELECT query. When you use this syntax, you are creating a managed table, with the option to define the location of the table.

In this section, you will create the table within the catalog and schema you created in the prerequisite tutorial.

Step 1: Verify cluster, catalog, and schema

When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query.

Step 2: Create table

Next, it's time to create a new table following the format and contents of the tpch.sf1.nation table.

To do this, start by running a query against the tpch.sf1.nation table to verify there are 25 rows present. This helps you to ensure that you understand what information is held in the table before you recreate the same information.

SELECT * FROM tpch.sf1.nation
CREATE TABLE
 nation AS
SELECT
 *
FROM
 tpch.sf1.nation;

Step 3: Verify table structure

There are two ways to verify that your new table exists and has the correct structure and content. Let's take a look at both options.

DESCRIBE nation;

Step 4: Verify match between data using the EXCEPT command

SELECT * FROM nation
EXCEPT
SELECT * FROM tpch.sf1.nation;

Step 5: Generate Data Definition Language (DDL)

When you create a table, you need to define the data structure that will be employed. This is known as the Data Definition Language (DDL). Starburst helps you automate this process.

Step 6: Review DDL output

After running the query in the previous step, you should have received an output similar to the image below. There are two focus areas, indicated by the blue and red highlights.

The first focus area, noted in blue, defines the logical representation of the table. It provides information about the schema that will be applied to the table and is similar to the approach used to construct a table in other database systems.

The second focus area, noted in red, defines two other crucial aspects of schema-on-read solutions, specifically:

Step 7: Locate data

Notice that you did not define a specific folder location in the CREATE TABLE statement in the last step. Although a specific location can be defined if needed, it can also be left undefined. When left to the default, it is based on where the schema itself is rooted in the data lake hierarchy.

Recall that when you created the tmp_cat catalog in the prerequisite tutorial, you provided the following details for the Metastore configuration:

By default, the schema is stored in a folder name below the Default directory name. In this case, the default directory name is projects, and the bucket name is starburst-tutorials. The table name is then another folder directly below the schema's folder.

This means your table's base directory is projects/tmp_firstname_lastname_postalcode/nation. If you had access to AWS S3, you would see a directory tree similar to the following:

The nation folder contains data files. In this case, you only have a single file because the table only has 25 rows.

Background

Data Definition Language (DDL) and Data Manipulation Language (DML) are foundational to relational databases. However, they can also be used to create and modify objects in data lakes. Because the structure of a data lake differs from that of a database, there are some key differences in how DDL and DML are employed. In this section, you'll observe some of those differences, beginning with the creation of a new table.

Step 1: Create friends table

Let's begin by creating a new, managed table to hold information about your friends.

CREATE TABLE friends (
   name varchar(100),
   year_met smallint,
   fav_color varchar(50),
   nation_born bigint
);

Step 2: Verify file format

If you do not specify a file format in your CREATE TABLE statement, the default format will be used. In this case, the default format is ORC. You can use a SHOW CREATE TABLE statement to verify this.

SHOW CREATE TABLE friends;

Step 3: Add record to friends table

Let's add one record to the table for now. Later we'll add more.

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Fred', 2002, 'brown', 13);
SELECT * FROM friends;

Step 4: View data files

As before, you do not have access to the S3 bucket, but we are providing screenshots of what you would see if you did have access.

Step 5: Add multiple records to table

Now it's time to insert three records into the friends table at once. After, we'll once again provide a screenshot of the S3 bucket for your reference.

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Susie', 2012, 'blue', 9),
 ('Bertha', 1999, 'pink', 24),
 ('Franklin', 2019, 'green', 24);

Step 6: Create duplicate record

Let's see what happens if we try to run the first INSERT statement again. Will it create a duplicate record?

INSERT INTO
 friends (name, year_met, fav_color, nation_born)
VALUES
 ('Fred', 2002, 'brown', 13);

Step 7: Update a record

Oftentimes, it is necessary to update records in a table. Let's try to do this with the friends table.

UPDATE friends
   SET fav_color = 'purple' 
 WHERE name = 'Susie';

DELETE FROM friends
 WHERE name = 'Susie';

We will not address partitions in this tutorial, but Hive tables are generally insert-only in nature. There are, however, alternate "table formats" that allow full INSERT, UPDATE, and DELETE features, as well as usage of the MERGE command.

Additional efforts must be employed with Hive tables to recreate the underlying files from which the table gets queried. Some of these approaches will be covered in a future data pipelines tutorial.

Step 8: Drop friends table

We've completed our work with the friends table, so you can safely drop it.

DROP TABLE friends;

Background

Up to this point, you have been creating Hive tables that are sometimes called managed tables. Generally speaking, you let the system identify where the underlying location will be, as the earlier example explained. While each table is configured to a specific directory and has a known file type, you do not typically place files on the filesystem directly. The Trino engine does the heavy lifting for you as a result of a command such as INSERT.

Additionally, when a DROP TABLE command is issued on a managed table, the underlying data is deleted, along with the table's metadata.

External tables are the other primary type of Hive tables. External tables are often located in an alternate data lake location than where Trino would typically store them. You can identify these tables by the presence of the external_location property in the DDL.

Step 1: Create external table

In preparation for this tutorial, we've placed several csv files in our S3 bucket. You are going to create an external table over those files.

CREATE TABLE
 external_logs (
   event_time TIMESTAMP,
   ip_address VARCHAR (15),
   app_name VARCHAR (25),
   process_id SMALLINT,
   log_type VARCHAR (15),
   log_level VARCHAR (15),
   message_id VARCHAR (15),
   message_details VARCHAR (555)
 )
WITH
 (
   external_location = 's3://starburst-tutorials/projects/serverlogs/logs_5min_ingest_csv/',
   format = 'TEXTFILE',
   textfile_field_separator = ','
 );
SELECT * FROM external_logs;

Step 2: View data files

External tables are often utilized when another system is creating the files. For example, a frequent scenario is an external table being used for recently ingested files that are being loaded into the data lake. In addition to allowing other systems to populate data into the table, these data lake files are easily consumable from other frameworks and systems.

In the ingestion model, you often see similar-sized files placed in the table's underlying directory periodically. Starburst does not care what these files are named, but depending on the system saving the files, you may get some heuristic information from the file names. You might be able to decipher the pattern from these example files being used in the table you just created.

Remember, you do not have access to S3 directly and these screenshots are presented to aid in your understanding.

Step 3: Drop external table

Recall that when you drop an external table, only the table metadata is deleted; the underlying files remain. Let's test that now.

DROP TABLE external_logs;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of the basics of Hive tables, including how to create managed and external tables. You should also be aware of some of the limitations when working with Hive tables.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.