Last Updated: 2024-10-24
Several factors can affect the performance of a data lake. Ensuring optimal performance requires attention to how data is ingested, stored, and processed. This tutorial will focus on how table partitioning and bucketing, used independently or together, can improve data lake performance. Through hands-on exercises, you will see how both of these strategies will read far less data than a full scan of a table when there is a filter that aligns appropriately.
Once you've completed this tutorial, you will be able to:
Table partitioning is an optimization technique that data lakes employ as an alternative to indexing. A partition is simply a way to divide data into subdirectories that are designed around logical groupings. In other words, partitioning divides data into subfolders. The field(s) you partition a table on become subfolders of the table's data lake folder. When a WHERE
clause is aligned to the partitioned field, then only the folder(s) that meet the particular criteria are read – the rest are ignored.
For example, suppose you have five years of historical data. Rather than putting all of that data into a single folder, you may choose to create a folder for each year, ensuring that the data for each year is in the appropriate folder. This is considered partitioning by year. Efficiency is gained when a query includes the partition key in the predicate or join, because the query engine can avoid reading partitions that don't match.
Your first exercise will be to create two tables with identical data. The only difference will be that one is partitioned and one is not. This will allow you to compare the amount of data being read when each table is queried. You'll begin with the unpartitioned table.
When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query. As a reminder, you should have created the catalog and schema you'll be using during the prerequisite tutorial.
aws-us-east-1-free
tmp_cat
tmp_firstname_lastname_postalcode
Next, it's time to create a new, unpartitioned table following the format and contents of the tpch.sf1.orders
table.
tpch.sf1.orders
table:CREATE TABLE orders_no_pttn
WITH (
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
Now that you've created a new table, it's time to query it. After you run the query, you'll observe how much data was read from your unpartitioned table.
SELECT * from orders_no_pttn
WHERE totalprice >= 430000.00;
When you run a query in Starburst Galaxy, you can learn a lot of information about the query by going to the query details page.
order_no_pttn
table, which accounted for 33.7MB of data. That equates to a full table scan of all records from the data lake.Let's enhance the filtering of the previous query by adding a filter on the priority
field.
SELECT * from orders_no_pttn
WHERE totalprice >= 430000.00
AND orderpriority = '4-NOT SPECIFIED';
In the last section, you created and queried an unpartitioned table. Through that exercise, you learned that when a table is unpartitioned, the entire table is scanned during query execution, even when filters are applied.
In this section, you'll create another orders table, but this one will be partitioned on orderpriority
. This will allow you to verify that less data is read on partitioned tables when the WHERE
clause is aligned to the partitioned field, thus enhancing the performance of the data lake.
To create a partitioned version of the orders
table, you'll need to modify the CTAS statement you used to create the unpartitioned table.
partitioned_by
clause:CREATE TABLE orders_pttn_priority
WITH (
partitioned_by = ARRAY['orderpriority'],
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
orderpriority
is last:CREATE TABLE orders_pttn_priority
WITH (
partitioned_by = ARRAY['orderpriority'],
type = 'hive'
)
AS
SELECT orderkey, custkey, orderstatus, totalprice,
orderdate, clerk, shippriority, comment,
orderpriority
FROM tpch.sf1.orders;
It's time to query the partitioned table, using the same query from Step 3 in the previous section. This query does not include the partition column in the WHERE
clause, which means that the entire table will still be scanned. After you've queried the table, you'll see how to use a query plan to confirm a full table scan.
SELECT * from orders_pttn_priority
WHERE totalprice >= 430000.00;
You can also review the Query plan to see if a full table scan occurred. You should already be on the Query details page under the Advanced tab. The Query plan is located at the bottom of the page.
orderpriority
column. This is further evidence that a full table scan occurred.In this step, you'll take advantage of the partitioning on your table by adding an orderpriority
filter.
SELECT * from orders_pttn_priority
WHERE totalprice >= 430000.00
AND orderpriority = '4-NOT SPECIFIED';
In our example, 300K rows and 6.92 MB were read. This represents about 20% of the full table scan's metrics. The sample data from TPCH was relatively evenly distributed across the five order priority values, which provides anecdotal evidence that only one of the partition folders was read. In the next step, you'll confirm that by looking at the query plan.
Let's take a look at the query plan to further confirm that only one partition folder was read during the execution of the last query. You should already be in the Advanced tab of the Query details page.
orderpriority
row in the query plan.Note that only one folder was read. This solidifies the anecdotal evidence with actual results. When the query engine can read a subset of the partition folders, this is referred to as partition pruning.
Table bucketing is a data lake technique where tables are divided into fixed-sized, smaller segments called buckets based on the values of a specific column (or columns). Bucketing optimizes data storage and speeds up query performance, especially for queries involving joins or aggregations on the bucketed columns.
In table bucketing:
For example, if a table is bucketed by user_id
into 4 buckets, each unique user_id
is hashed to determine its bucket. Rows with the same hash value (for user_id
) end up in the same bucket.
Let's take a look at this in practice.
In this section, you are going to recreate the orders
table, this time bucketing it on the customer identifier column.
orders
table, bucketed on custkey
:CREATE TABLE orders_bckt_custkey
WITH (
bucketed_by = ARRAY['custkey'],
bucket_count = 4,
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
You do not have access to review the data in S3, but here is an example of the contents of this table. The data was evenly spread by custkey
into four separate files of similar size. The sum of these file sizes adds up to about the size of the full table.
In this step, you're going to run the same query on two different tables. One will be the unpartitioned, unbucketed orders
table, and the other will be the bucketed table you just created. This exercise will compare the amount of data processed for each query.
orders_no_pttn
table:SELECT *
FROM orders_no_pttn
WHERE custkey = 130000;
orders_bckt_custkey
table:SELECT *
FROM orders_bckt_custkey
WHERE custkey = 130000;
custkey
could only be in one of the four buckets you created.Remember, partitioning and bucketing can be used independently or in combination.
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of how partitioning and bucketing can improve data lake performance. The data used in this tutorial was relatively small in size, which means that the performance improvements weren't very dramatic. With larger volumes of data you will see a much more obvious improvement.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!