Trino predicate pushdown for String type uses Min-Max Strategy

nikhil · September 5, 2022, 4:29am

Hi All,

I’m facing a problem with the way trino is doing predicate pushdown for string datatype. I could see in the parquet footer it is having [Min-Max strategy] even for the string datatype, due to this even though query is fetching me only 3 records trino is reading almost 95% of the total parquet files due to this my query execution time is too high.

Is there a way to handle this better for String datatype ?

Please do share docs or suggest me ways how to approach this problem.

Thankyou

lester · July 14, 2023, 4:10pm

With trino trying to see if the Parquet (or ORC) file MIGHT have the string value you are looking for, it is VERY LIKELY that the scenario you suggested below is going to happen often. Basically, the footer says I have min of ‘aaaaa’ and max of ‘zzzzz’ then trino says “yes, it is POSSIBLE that ‘lester’ is present” (hence your reading all the data to find a few records).

Ok, what else could you do? A classic Parquet/ORC trick is to rewrite your data by doing a total order sort on it thus creating something LIKE a secondary index. If you sort that string column you are regularly filtering on then it is very likely you can create a model where only a small percentage of the files might actually have a POSSIBILITY of having your records in them.

This works pretty awesome on your number-based columns, too!! The only drawback is that you have to do this somewhat regularly until the data itself settles down. For something like a time-based partition, you can usually wait until that partition is no longer being written to and THEN read/sort/write(/andDeleteTheOld) data again.

Although the blog post at Improving performance with Iceberg sorted tables is centered on Apache Iceberg (which itself allow Parquet/ORC file formats), it visualizes what happens at run time for any data lake table when you use Parquet (or ORC) files. For Iceberg, the including compaction process in the trino connector will ALSO take care of the sorting for you directly on the rewritten files (and manage it all in a transaction).

Oh… since I’m plugging Iceberg… all of those footer stats from the files are actually sorted in metadata files on the lake then all of this parsing of the footers for stats is even faster because trino doesn’t have to go touch all of the underlying files for this mix/max metadata.