Number of splits using the Hive connector

While performing some tests on file sizes and formats with the Hive table format I noticed the number of splits determined by Trino (the tests were run on Starburst Galaxy with a Great Lakes connectivity catalog) were not exactly what I was expecting so I dug into the docs to figure it out and thought I’d share my findings here.

This information is part of a larger blog post I previously published at https://lestermartin.wordpress.com/2023/05/18/determining-nbr-splits-w-trino-starburst-galaxy-hive-table-format/ in case you want to take a peek at that.

Here are some details of the two tables I was testing with.

Of course, I got a massive performance improvement with the fewer, more reasonably-sized, files than with 10,080 small files. That said, the behind the scenes breakdown of the number of splits was not exactly what I had imagined and I wanted to find out more.

The 5min_ingest table’s behavior was easy enough to decipher; it had 10,080 splits which aligned to the 10,800 files.

The mystery came when I queried daily_rollup which is a compacted version of the same data. Ideally, the compaction could have created larger (thus fewer) files each day. I imagined these 105 somewhat reasonably-sized files would have become 105 splits (i.e. one/file as I saw with the small files). I ended up getting 205 which puzzled me. I (correctly) figured that the engine was breaking the files into 2 sections thus giving me 2 splits/file, but the math didn’t really add up. That should be 210, not 205. My head began to hurt wondering what happened.

*Yep, RTFM time! These 3 properties are what I needed to look at a bit more.

I ran through my two table setups from above and YAY!… IT ALL MADE SENSE! Let’s do the math…

  • 5min_ingest table (10,080 files; 500KB each)
    • Table scan yielded 10,080 files, all smaller than hive.max-initial-split-size, so the first 200 splits were 200 of these files
    • Now that the hive.max-initial-splits threshold was reached, each of the remaining 9,880 files are smaller than the hive.max-split-size limit, so each are assigned to a split
    • 200 + 9,880 = 10,800 splits
  • daily_rollup table (105 files; 50MB each)
    • Table scan yielded 105 files, all larger than hive.max-initial-split-size
    • For the first 100 of these 50MB files, the initial 32MB (ref: hive.max-initial-split-size) was used to create a split and the remaining 18MB (which was not larger than hive.max-split-size) became the second split which hit the 200 hive.max-initial-splits value
    • The remaining 5 files were all under the hive.max-split-size limit, so were each assigned to a split
    • 200 + 5 = 205 splits