Last Updated: 2024-09-04
As the first data lake table format, Apache Hive has long been a popular choice among data engineers. For this reason, it is a good option for those wanting to learn the basics of data lake tables.
Starburst Galaxy supports the Hive table format and can be used to easily create and manage Hive tables as you'll see in this tutorial.
This tutorial will show you how to perform basic database operations with Starburst Galaxy using Hive. This includes the creation of tables, using a number of different methods. The tutorial also introduces the concept of inserts, record updates, and external tables. The limitations of Hive will also be discussed.
Once you've completed this tutorial, you will be able to:
You'll begin your data lake table exploration by using CTAS to create a table. CTAS
, or CREATE TABLE AS
, creates a table that is populated with the results of a SELECT
query. When you use this syntax, you are creating a managed table, with the option to define the location of the table.
In this section, you will create the table within the catalog and schema you created in the prerequisite tutorial.
When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query.
aws-us-east-1-free
tmp_cat
tmp_firstname_lastname_postalcode
Next, it's time to create a new table following the format and contents of the tpch.sf1.nation
table.
To do this, start by running a query against the tpch.sf1.nation
table to verify there are 25 rows present. This helps you to ensure that you understand what information is held in the table before you recreate the same information.
tpch.sf1.nation
table:SELECT * FROM tpch.sf1.nation
tpch.sf1.nation
table:CREATE TABLE
nation AS
SELECT
*
FROM
tpch.sf1.nation;
There are two ways to verify that your new table exists and has the correct structure and content. Let's take a look at both options.
DESCRIBE nation;
EXCEPT
commandSELECT
queries on the original table and your nation
table, run the EXCEPT
query listed below. It will verify that the two tables contain the same data.SELECT * FROM nation
EXCEPT
SELECT * FROM tpch.sf1.nation;
When you create a table, you need to define the data structure that will be employed. This is known as the Data Definition Language (DDL). Starburst helps you automate this process.
nation
table in the list on the left of the query editor window. nation
table.SHOW CREATE TABLE
. This will copy the SQL to the query editor window.After running the query in the previous step, you should have received an output similar to the image below. There are two focus areas, indicated by the blue and red highlights.
The first focus area, noted in blue, defines the logical representation of the table. It provides information about the schema that will be applied to the table and is similar to the approach used to construct a table in other database systems.
The second focus area, noted in red, defines two other crucial aspects of schema-on-read solutions, specifically:
Notice that you did not define a specific folder location in the CREATE TABLE
statement in the last step. Although a specific location can be defined if needed, it can also be left undefined. When left to the default, it is based on where the schema itself is rooted in the data lake hierarchy.
Recall that when you created the tmp_cat
catalog in the prerequisite tutorial, you provided the following details for the Metastore configuration:
By default, the schema is stored in a folder name below the Default directory name. In this case, the default directory name is projects
, and the bucket name is starburst-tutorials
. The table name is then another folder directly below the schema's folder.
This means your table's base directory is projects/tmp_firstname_lastname_postalcode/nation
. If you had access to AWS S3, you would see a directory tree similar to the following:
The nation
folder contains data files. In this case, you only have a single file because the table only has 25 rows.
Data Definition Language (DDL) and Data Manipulation Language (DML) are foundational to relational databases. However, they can also be used to create and modify objects in data lakes. Because the structure of a data lake differs from that of a database, there are some key differences in how DDL and DML are employed. In this section, you'll observe some of those differences, beginning with the creation of a new table.
friends
tableLet's begin by creating a new, managed table to hold information about your friends.
CREATE TABLE friends (
name varchar(100),
year_met smallint,
fav_color varchar(50),
nation_born bigint
);
WITH
section. The location of the file system will default to something similar to //filesystem/catalog/schema/table
. The folder will not be created until the table is populated with data. If you do not specify a file format in your CREATE TABLE
statement, the default format will be used. In this case, the default format is ORC. You can use a SHOW CREATE TABLE
statement to verify this.
SHOW CREATE TABLE friends;
friends
tableLet's add one record to the table for now. Later we'll add more.
INSERT INTO
friends (name, year_met, fav_color, nation_born)
VALUES
('Fred', 2002, 'brown', 13);
SELECT * FROM friends;
As before, you do not have access to the S3 bucket, but we are providing screenshots of what you would see if you did have access.
Now it's time to insert three records into the friends
table at once. After, we'll once again provide a screenshot of the S3 bucket for your reference.
friends
table:INSERT INTO
friends (name, year_met, fav_color, nation_born)
VALUES
('Susie', 2012, 'blue', 9),
('Bertha', 1999, 'pink', 24),
('Franklin', 2019, 'green', 24);
INSERT
statement and the other from the second INSERT
. The three recently inserted records are all located in the second file.Let's see what happens if we try to run the first INSERT
statement again. Will it create a duplicate record?
INSERT INTO
friends (name, year_met, fav_color, nation_born)
VALUES
('Fred', 2002, 'brown', 13);
friends
table again. What do you see? The image below shows that one of your records has been duplicated. This happened because you did not declare a primary key (PK). It highlights a key point: when working with big data schema-on-read tables such as this, you will not have the chance to create (or validate) a primary key. Instead, data integrity activities like this are managed by the process that populates the table. This will be covered in greater detail in a future data pipelines tutorial.Oftentimes, it is necessary to update records in a table. Let's try to do this with the friends
table.
UPDATE friends
SET fav_color = 'purple'
WHERE name = 'Susie';
DELETE
on this same row? Try it and see.DELETE FROM friends
WHERE name = 'Susie';
We will not address partitions in this tutorial, but Hive tables are generally insert-only in nature. There are, however, alternate "table formats" that allow full INSERT
, UPDATE
, and DELETE
features, as well as usage of the MERGE
command.
Additional efforts must be employed with Hive tables to recreate the underlying files from which the table gets queried. Some of these approaches will be covered in a future data pipelines tutorial.
We've completed our work with the friends
table, so you can safely drop it.
DROP TABLE friends;
Up to this point, you have been creating Hive tables that are sometimes called managed tables. Generally speaking, you let the system identify where the underlying location will be, as the earlier example explained. While each table is configured to a specific directory and has a known file type, you do not typically place files on the filesystem directly. The Trino engine does the heavy lifting for you as a result of a command such as INSERT
.
Additionally, when a DROP TABLE
command is issued on a managed table, the underlying data is deleted, along with the table's metadata.
External tables are the other primary type of Hive tables. External tables are often located in an alternate data lake location than where Trino would typically store them. You can identify these tables by the presence of the external_location
property in the DDL.
In preparation for this tutorial, we've placed several csv files in our S3 bucket. You are going to create an external table over those files.
external_location
property telling Starburst where to look for the files:CREATE TABLE
external_logs (
event_time TIMESTAMP,
ip_address VARCHAR (15),
app_name VARCHAR (25),
process_id SMALLINT,
log_type VARCHAR (15),
log_level VARCHAR (15),
message_id VARCHAR (15),
message_details VARCHAR (555)
)
WITH
(
external_location = 's3://starburst-tutorials/serverlogs/logs_5min_ingest_csv/',
format = 'TEXTFILE',
textfile_field_separator = ','
);
SELECT
statement against this table look appropriate:SELECT * FROM external_logs;
External tables are often utilized when another system is creating the files. For example, a frequent scenario is an external table being used for recently ingested files that are being loaded into the data lake. In addition to allowing other systems to populate data into the table, these data lake files are easily consumable from other frameworks and systems.
In the ingestion model, you often see similar-sized files placed in the table's underlying directory periodically. Starburst does not care what these files are named, but depending on the system saving the files, you may get some heuristic information from the file names. You might be able to decipher the pattern from these example files being used in the table you just created.
Remember, you do not have access to S3 directly and these screenshots are presented to aid in your understanding.
external_logs
table:Recall that when you drop an external table, only the table metadata is deleted; the underlying files remain. Let's test that now.
external_logs
table:DROP TABLE external_logs;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of the basics of Hive tables, including how to create managed and external tables. You should also be aware of some of the limitations when working with Hive tables.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!