Last Updated: 2024-10-24

Background

Several factors can affect the performance of a data lake. Ensuring optimal performance requires attention to how data is ingested, stored, and processed. This tutorial will focus on how table partitioning and bucketing, used independently or together, can improve data lake performance. Through hands-on exercises, you will see how both of these strategies will read far less data than a full scan of a table when there is a filter that aligns appropriately.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

Background

Table partitioning is an optimization technique that data lakes employ as an alternative to indexing. A partition is simply a way to divide data into subdirectories that are designed around logical groupings. In other words, partitioning divides data into subfolders. The field(s) you partition a table on become subfolders of the table's data lake folder. When a WHERE clause is aligned to the partitioned field, then only the folder(s) that meet the particular criteria are read – the rest are ignored.

For example, suppose you have five years of historical data. Rather than putting all of that data into a single folder, you may choose to create a folder for each year, ensuring that the data for each year is in the appropriate folder. This is considered partitioning by year. Efficiency is gained when a query includes the partition key in the predicate or join, because the query engine can avoid reading partitions that don't match.

Background

Your first exercise will be to create two tables with identical data. The only difference will be that one is partitioned and one is not. This will allow you to compare the amount of data being read when each table is queried. You'll begin with the unpartitioned table.

Step 1: Verify cluster, catalog, and schema

When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query. As a reminder, you should have created the catalog and schema you'll be using during the prerequisite tutorial.

Step 2: Create table

Next, it's time to create a new, unpartitioned table following the format and contents of the tpch.sf1.orders table.

CREATE TABLE orders_no_pttn
   WITH (
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;

Step 3: Query new table

Now that you've created a new table, it's time to query it. After you run the query, you'll observe how much data was read from your unpartitioned table.

SELECT * from orders_no_pttn
 WHERE totalprice >= 430000.00;

Step 4: Locate metrics for data read

When you run a query in Starburst Galaxy, you can learn a lot of information about the query by going to the query details page.

Step 5: Add additional filter to query

Let's enhance the filtering of the previous query by adding a filter on the priority field.

SELECT * from orders_no_pttn
 WHERE totalprice >= 430000.00
   AND orderpriority = '4-NOT SPECIFIED';

Background

In the last section, you created and queried an unpartitioned table. Through that exercise, you learned that when a table is unpartitioned, the entire table is scanned during query execution, even when filters are applied.

In this section, you'll create another orders table, but this one will be partitioned on orderpriority. This will allow you to verify that less data is read on partitioned tables when the WHERE clause is aligned to the partitioned field, thus enhancing the performance of the data lake.

Step 1: Create partitioned table

To create a partitioned version of the orders table, you'll need to modify the CTAS statement you used to create the unpartitioned table.

CREATE TABLE orders_pttn_priority
   WITH (
      partitioned_by = ARRAY['orderpriority'],
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;

CREATE TABLE orders_pttn_priority
   WITH (
      partitioned_by = ARRAY['orderpriority'],
      type = 'hive'
   )
AS
   SELECT orderkey, custkey, orderstatus, totalprice,
          orderdate, clerk, shippriority, comment,
          orderpriority
     FROM tpch.sf1.orders;

Step 2: Query partitioned table

It's time to query the partitioned table, using the same query from Step 3 in the previous section. This query does not include the partition column in the WHERE clause, which means that the entire table will still be scanned. After you've queried the table, you'll see how to use a query plan to confirm a full table scan.

SELECT * from orders_pttn_priority
 WHERE totalprice >= 430000.00;

Step 3: Review query plan

You can also review the Query plan to see if a full table scan occurred. You should already be on the Query details page under the Advanced tab. The Query plan is located at the bottom of the page.

Step 4: Leverage partition pruning

In this step, you'll take advantage of the partitioning on your table by adding an orderpriority filter.

SELECT * from orders_pttn_priority
 WHERE totalprice >= 430000.00
   AND orderpriority = '4-NOT SPECIFIED';

In our example, 300K rows and 6.92 MB were read. This represents about 20% of the full table scan's metrics. The sample data from TPCH was relatively evenly distributed across the five order priority values, which provides anecdotal evidence that only one of the partition folders was read. In the next step, you'll confirm that by looking at the query plan.

Step 5: Review query plan

Let's take a look at the query plan to further confirm that only one partition folder was read during the execution of the last query. You should already be in the Advanced tab of the Query details page.

Note that only one folder was read. This solidifies the anecdotal evidence with actual results. When the query engine can read a subset of the partition folders, this is referred to as partition pruning.

Background

Table bucketing is a data lake technique where tables are divided into fixed-sized, smaller segments called buckets based on the values of a specific column (or columns). Bucketing optimizes data storage and speeds up query performance, especially for queries involving joins or aggregations on the bucketed columns.

How it works

In table bucketing:

For example, if a table is bucketed by user_id into 4 buckets, each unique user_id is hashed to determine its bucket. Rows with the same hash value (for user_id) end up in the same bucket.

Let's take a look at this in practice.

Background

In this section, you are going to recreate the orders table, this time bucketing it on the customer identifier column.

Step 1: Create table

CREATE TABLE orders_bckt_custkey
   WITH (
      bucketed_by = ARRAY['custkey'],
      bucket_count = 4,
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;

You do not have access to review the data in S3, but here is an example of the contents of this table. The data was evenly spread by custkey into four separate files of similar size. The sum of these file sizes adds up to about the size of the full table.

Step 2: Query two tables

In this step, you're going to run the same query on two different tables. One will be the unpartitioned, unbucketed orders table, and the other will be the bucketed table you just created. This exercise will compare the amount of data processed for each query.

SELECT *
  FROM orders_no_pttn
 WHERE custkey = 130000;

SELECT *
  FROM orders_bckt_custkey
 WHERE custkey = 130000;

Remember, partitioning and bucketing can be used independently or in combination.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how partitioning and bucketing can improve data lake performance. The data used in this tutorial was relatively small in size, which means that the performance improvements weren't very dramatic. With larger volumes of data you will see a much more obvious improvement.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.