How to determine the right num of buckets

mcbenzing · July 4, 2024, 1:04pm

I want to create a hive table with buckets. I found this formula to determine the num of blocks to be used:

table_size/block_size=x
find n such that 2^n > x

To do this, I would like to know the block size used by the hive connector. As far as I know, the default block size is 128 or 256MB. How can I know which one is used as the default in Starburst?

I was also wondering how to determine the most efficient number of buckets when using iceberg.

Thanks in advance!

lester · July 4, 2024, 1:27pm

I’m glad to see that you realize that TOO MANY buckets will cause too many small files. To me… the “right” number of buckets has two perspectives. First, what are you hoping to get from bucketing? Do you want the focusing in of which file(s) to be read when looking for a specific value of the bucket_by column OR are you trying to achieve something like a SMB join (Trino calls it a collocated join) to help with performance in that space. Second, one has to remember that the bucketed files are created each time you add more data, so if you insert 10x a day and have 10 buckets defined you’ll have 100 bucketed files at the end of the day.

I bring all that up as MOST tables do not really need to be bucketed and often are hurt by using them. We end up with an even worse small files problem (processing problem as those small files will be worked on as a single split (similar, but not exactly the same as HDFS blocks).

Ok, so what is my answer to your question? As a good consultant always says, “it depends”. First, you need to be making sure with some technical validation that your problem (coupled with your data design and data volume) is actually being addressed by bucketing and then second, make sure you are NOT making things worse and worse with tiny files.

As far as “# of blocks” – realize that Trino is not coupled to HDFS like Hive is. HDFS is just one, of many, possible data lake repositories. Trino refers to things as splits. These two blog posts show some insight on how the Hive and Iceberg connectors determine the number of splits.

Ultimately, this drives down to file sizes and Starburst’s doc say 64-512MB are the optimal size. MY OWN recommendation is to target (i.e. not all files will be, or need to be, the same size, but MOST need to be in this range ) having files (post compaction is fine if done periodically enough) that are about 100MB in size, or MULTIPLES of that as long as they are 500MB or bigger.

I know I’m rambling, but I could ramble more on this. Sounds like I need to host another Live with Lester session again for some open discussions on stuff like this.

And to end my rambles, I can’t leave you alone thinking about buckets when you are talking Hive and Iceberg. Iceberg implements bucketing differently (basically as partitions) as you can see somewhere around minute 40 in the video referenced in the middle of this next blog post.