While performing some tests on file sizes and formats with the Hive table format I noticed the number of splits determined by Trino (the tests were run on Starburst Galaxy with a Great Lakes connectivity catalog) were not exactly what I was expecting so I dug into the docs to figure it out and thought I’d share my findings here.
This information is part of a larger blog post I previously published at https://lestermartin.wordpress.com/2023/05/18/determining-nbr-splits-w-trino-starburst-galaxy-hive-table-format/ in case you want to take a peek at that.
Here are some details of the two tables I was testing with.
Of course, I got a massive performance improvement with the fewer, more reasonably-sized, files than with 10,080 small files. That said, the behind the scenes breakdown of the number of splits was not exactly what I had imagined and I wanted to find out more.
The 5min_ingest
table’s behavior was easy enough to decipher; it had 10,080 splits which aligned to the 10,800 files.
The mystery came when I queried daily_rollup
which is a compacted version of the same data. Ideally, the compaction could have created larger (thus fewer) files each day. I imagined these 105 somewhat reasonably-sized files would have become 105 splits (i.e. one/file as I saw with the small files). I ended up getting 205 which puzzled me. I (correctly) figured that the engine was breaking the files into 2 sections thus giving me 2 splits/file, but the math didn’t really add up. That should be 210, not 205. My head began to hurt wondering what happened.
*Yep, RTFM time! These 3 properties are what I needed to look at a bit more.
I ran through my two table setups from above and YAY!… IT ALL MADE SENSE! Let’s do the math…
5min_ingest
table (10,080 files; 500KB each)- Table scan yielded 10,080 files, all smaller than
hive.max-initial-split-size
, so the first 200 splits were 200 of these files - Now that the
hive.max-initial-splits
threshold was reached, each of the remaining 9,880 files are smaller than thehive.max-split-size
limit, so each are assigned to a split - 200 + 9,880 = 10,800 splits
- Table scan yielded 10,080 files, all smaller than
daily_rollup
table (105 files; 50MB each)- Table scan yielded 105 files, all larger than
hive.max-initial-split-size
- For the first 100 of these 50MB files, the initial 32MB (ref:
hive.max-initial-split-size
) was used to create a split and the remaining 18MB (which was not larger thanhive.max-split-size
) became the second split which hit the 200hive.max-initial-splits
value - The remaining 5 files were all under the
hive.max-split-size
limit, so were each assigned to a split - 200 + 5 = 205 splits
- Table scan yielded 105 files, all larger than