3 Iceberg partitioning best practices to improve performance

Share

Imagine that your desk resembled the above image. Now you need to find all the invoices for a particular month to calculate your average billing amount. Without any organizational scheme, you would have to review every single paper and understand its relevance. Now imagine you had the foresight to file invoices in a folder scheme based on the invoice date. Your problem would be much easier to solve as you could just pull a monthly folder to calculate the average amount. No need to search through all the information you don’t need.

Data lake tables are conceptually similar. They allow a location (i.e. a folder) full of files, with the same type of information and file format, to be queried like a classic database table. Metadata that identifies the columns of the table, the folder where the data is stored, and the file format being utilized, are the additional ingredients that create the facade of a database table.

This “schema on read” strategy offers flexibility and scalability — the more files in the folder, the more records in the table. The problem is that as it scales, performance can be negatively affected. This is because more and more data will have to be reviewed, which takes additional resources before the final result of a query can be returned.

What is partitioning and how does partitioning help?

There are multiple approaches to improving performance & scalability, but partitioning is the most mature and widely used. Partitioning creates subdirectories underneath a table’s main data lake location and saves files into this hierarchy. The primary benefit of partitioning is that a query will only read a small percentage of subfolders, instead of all your files.

For example, if you knew the majority of queries run against the server_logs table were only targeting a single app_name value, then you could create a hierarchy for your data lake, as seen below.

It is easy to see which subdirectory to look in if you are only interested in ERP server logs. It is also straightforward to notice you would only have to review 3285 files instead of the full 10,000 that the above diagram shows. This is only a third of the files and might not appear to be a big help (it actually is), but what if there were 100 app_name values in the table and they were evenly distributed? In that situation, you would look at about 1% of the files when researching a specific application’s server log records, instead of all the files available to you. 

How to implement Iceberg partitioning?

It’s one thing to visually navigate the partitions, and another to write a SQL command to select the proper location. How do you indicate to the query engine to only read a single folder instead of all the files? The answer is NOT in the queries themselves. It is in the setup of the table in the first place.  You can enhance the CREATE TABLE statement to include an additional property in the WITH clause.

partitioning = ARRAY['app_name']

When a query is submitted with a filter aligned with the previously determined “partitioning strategy”, the SQL engine (Trino is what is recommended) performs an operation called “partition pruning” which only reads the appropriate folder as discussed above. 

WHERE app_name ='ERP' 

As the server_logs table grows, hopefully at an exponential rate, you can further enhance your partitioning strategy. If you do not consider scale, you will eventually have the same problem again; searching more and more files than you need to. Let’s say you find out that most queries being run will rarely be looking for data beyond a 45-day period. Fortunately, the  partitioning property can include more than one field, and you can use the date range information to maintain high performance as your data size increases.

Now the table’s base location has an even deeper hierarchy.

The following logical filters could be used to query a 45-day span of ERP server log records.

WHERE app_name = 'ERP'
   AND event_time > 2024-03-16 
   AND event_time < 2024-05-01 

The partition pruning would determine that only the following directories would be read.

Partitioning best practices

Partitioning improves performance & scalability when you know the query filtering that provides the best benefit overall. The biggest drawback is that you can only have a single partitioning strategy. Any edge queries falling outside of the implemented partitioning strategy may take significant time to complete. Here are some best practices to help you define partitioning in your tables and increase performance.

1. Make sure your table’s data is big enough to need partitioning

Visualizing a typical star schema will provide you with a great example.

  • The fact table at the center usually has a significant volume of data with new records being added at a high velocity. The suggestion isn’t that tables smaller than a TB would not benefit from partitioning. However, tables that size and larger will significantly improve performance when an appropriate partitioning strategy is defined.
  • The dimensional (lookup) tables are almost always significantly smaller and grow at a dramatically smaller rate. They are often measured in MBs or even into the double-digit GB ranges. These tables won’t necessarily benefit from partitioning in most cases.

2. Pick an efficient anchor—partitioning works best on low cardinality columns that have fairly uniform distributions

Ideally, a new partition is created periodically, data additions/modifications occur for a predictable period, and finally, the partition itself becomes static. This process repeats itself over and over. Here are a couple of examples.

  • For time-series immutable data usually queried by a specific date range (such as the server logs example in this blog post), partitioning on some derived element of the event’s timestamp is perfect.
  • Imagine a table designed to represent a very large number of records linked to a given test case. Assume a single-digit number of new test cases start each day and each runs for a finite duration ranging from a few days to a few months. Querying is primarily done against a specific test case and includes all records within it. In this scenario, partitioning on the test case identifier would be a great solution.

3. Don’t be too fine-grained in the partitioning strategy which could yield having too little data within each partition

Since the query engine will very often look at the files within a complete partition folder, you should imagine this level of granularity as similar to a non-partitioned table. Query engines like Trino thrive on quickly attacking big data query requests. Here are two very common mistakes.

  • When you have scenarios with more than a single column in the partitioning strategy, you should investigate thoroughly. In the server_logs example, it is likely that only partitioning on some derivative of the event timestamp would be a better solution even though many queries might have to look past other applications’ log records than it is searching for.
  • For time-series immutable data, the partition strategy often works best on a derivative of the timestamp. Teams often select a day-level partition to logically align with their filter granularity. The problem is that their day-level data is often not big enough to be efficient. As with the example presented earlier, defining at a month-level means in many cases more data is looked at, but the economies of scale of these big data query engines will often yield better performance.

Hidden partitioning and partition evolution: Not all table formats are the same

All table formats support partitioning, but not all include the same features. When selecting a table format and a query engine, be sure to factor these topics into your decision.

Partitioning is best when it is something users do not have to think about. Apache Hive query writers need to be aware of which column is being used for partitioning and include it in their SQL statement. This is an especially confusing situation when partitioning on a derivative of a timestamp (such as a month) because you end up with two different columns. Apache Iceberg features a concept called hidden partitioning which further includes transform functions to eliminate these well-known issues with Hive.  

Having a high percentage of “small files” that are < 100MB in size will cause all query engines performance problems at scale. Due to batch ingestion pipelines being scheduled quicker and quicker, as well as more and more streaming ingestion pipelines, these small files will be created. The modern table formats (which include Iceberg) all offer compaction tools to resolve this problem that was never formally addressed in Hive. These compaction functions rewrite & replace a large number of small files with a smaller number of larger files.

As indicated earlier, the biggest drawback is that you can only have a single partitioning strategy. If you needed to change this, you would need to define a new table with an alternative partitioning strategy and then rewrite the files by inserting the original table’s records into the new table. This is usually very difficult once your tables reach a certain size and scale. It is expensive to perform and you must maintain two copies of the data until you drop the original table. Fortunately, Iceberg supports partition evolution, allowing you to change the partitioning strategy on a table without requiring a rewrite of your data.

Conclusion: Apache Iceberg is the best table format when considering its partitioning approach and features

For your large tables that continue to grow, identifying an appropriate partitioning scheme turns that desk full of papers into a finely organized filing system. More importantly, it helps you get the answers you need much faster and using fewer resources thereby allowing your query engine to run more concurrent queries.

Apache Iceberg is the best table format when considering its partitioning approach and features. Iceberg is a specification, not necessarily a product. This means it is up to query engines to implement the features. Data lake query engines & cloud data warehouse vendors realize the importance of Iceberg and are rushing forward to support, at least in some small way, Iceberg with their products and offerings. 

Just be sure to select a vendor who is implementing Iceberg correctly and comprehensively as some have chosen to initially launch their Iceberg integrations without any support for partitioning, much less needed maintenance features like compaction described above.

To get the most of Iceberg, consider implementing an Icehouse which is an open lakehouse with two key ingredients – Trino as the open query engine and Apache Iceberg as the open table format.