Validating predicate pushdown on delta lake tables

This post is here to validate that predicate pushdown (specifically on a day-level and using string values) works in Trino using Delta Lake tables. The following bits were run on Starburst Galaxy, but should work on any Trino instance using the Delta Lake connector.

I created this table in Galaxy within a object storage connector (aka the Great Lakes connector).

CREATE TABLE orders_noparts (
    order_key bigint, order_status varchar, total_price double,
    order_timestamp timestamp
) WITH (
    type='delta'
);

I populated this table with 30K records from the TPC-H connector and upgraded orderdate from DATE to TIMESTAMP.

-- added 15K records with fake 9am time added
INSERT INTO orders_noparts
SELECT orderkey, orderstatus, totalprice,
       date_add('hour', 9, cast(orderdate as timestamp))
  FROM tpch.tiny.orders;

-- added 15K records with fake 9pm time added
INSERT INTO orders_noparts
SELECT orderkey, orderstatus, totalprice,
       date_add('hour', 21, cast(orderdate as timestamp))
  FROM tpch.tiny.orders;

Ran the following for some quick validation of what I’ve done so far.

-- verify 30K records
SELECT count() FROM orders_noparts;
-- verify NO partitions
SELECT * FROM "orders_noparts$partitions";

Created a new table, but set up with day-level partitions with the assumption the TIMESTAMP version of the order date is still present PLUS the addition of a VARCHAR version at the day-level of granularity to be used as the partitioned column.

-- new table with day-level partitioning
CREATE TABLE orders_day_part (
    order_key bigint, order_status varchar, total_price double,
    order_timestamp timestamp,
    order_day_part varchar
) WITH (
    type='delta',
    partitioned_by = ARRAY['order_day_part']
);

To keep the number of partitions manageable, I only added about 1500 records across 4 specific months (yields about 120 partitions (4 months x 30 days/month)).

INSERT INTO orders_day_part 
SELECT *, substring(to_iso8601(order_timestamp), 1, 10)
  FROM orders_noparts
 WHERE extract(YEAR from order_timestamp) = 1998
   AND extract(MONTH from order_timestamp) IN (1, 2);

INSERT INTO orders_day_part 
SELECT *, substring(to_iso8601(order_timestamp), 1, 10)
  FROM orders_noparts
 WHERE extract(YEAR from order_timestamp) = 1997
   AND extract(MONTH from order_timestamp) IN (3, 4);

Validated those assumptions.

-- verify ~ 1500 records
SELECT count() FROM orders_day_part;
-- verify ~ 120 partitions
SELECT * FROM "orders_day_part$partitions";

Run a query that will look at all ~120 partitions.

SELECT avg(total_price) 
  FROM orders_day_part;

Verify that there is no partition pruning occurring in the query plan output (i.e. see that the next example’s output is NOT included here).

Run a query that only looks in those folders between 4/10 - 4/20 (i.e. only 11 of the partitions).

SELECT avg(total_price) 
  FROM orders_day_part
 WHERE order_day_part >= '1997-04-10'
   AND order_day_part <= '1997-04-20';

Yep, there’s the partition pruning output showing only the folders for 4/10 - 4/20 are being read.

order_day_part:varchar:PARTITION_KEY
    :: [[1997-04-10, 1997-04-20]]