Top 5 reasons to not adopt Apache Iceberg

If it ain't broke, don't fix it

Share

Apache Iceberg is becoming increasingly popular and is turning into the de facto standard for table formats. The first-ever Iceberg Summit was held this month and all the major data platform vendors have committed support for this increasingly popular standard. While we are hearing all the reasons to adopt Apache Iceberg, it is also important to consider the reasons why Apache Iceberg might not be the right fit for your data strategy right now.

The top 5 reasons to NOT adopt Iceberg:

1. External ingestion

Hive includes a concept identified as external tables. Tables defined this way allow external systems to add and remove immutable files in a defined location on the data lake. Hive does not have to be aware of the additions/deletions as they happen and will simply read everything once queried. Iceberg does not allow this EXTERNAL configuration, instead requiring first-class SQL operations when loading, modifying, or deleting records.

Fortunately, many systems that leverage this model of allowing other systems to populate these external tables are doing so as the start of a data pipeline. In those models, these external tables are generally used as short-term staging zones. The pipeline frequently performs an INSERT statement from its data into a more structured table (often after performing quality and enrichment transformations) that houses all the data for the given table’s focus. This structure table is defined using more sophisticated and performance-optimizing characteristics, such as a columnar file format and an appropriate partition strategy.

In pipelines like these, it is acceptable to leave the staging table as it is while focusing efforts on ensuring the structure table is as optimized as possible. This will be the table to consider moving from Hive to Iceberg.

If the external table’s entire pipeline is made up of other systems loading file into the data lake, then the table can not be moved to Iceberg. In a scenario like this, there are likely performance and scalability issues that need to be addressed. When tackling that work effort, adopting the Iceberg table format is recommended.

Lastly, an existing external table whose ingestion is happening with SQL and not with external systems loading files can still be migrated to Iceberg. It will lose that feature as it does, so careful consideration is needed when reviewing external tables to fully understand their ingestion process.

It should be noted that a successful effort to adopt Iceberg as the table format of choice does not mean there cannot be other table formats in use. This staging scenario is a good example of this. Additionally, the tables that matter the most are the ones of significant size & scale that are the target of the most resource-intensive querying in your enterprise. It is recommended that any migration effort focuses on these tables first.

2. Compliance & compatibility issues

Every enterprise is different and under varying constraints. Some may find themselves in a situation where a new technology requires internal and/or external approval before utilizing. This could be as simple as Apache Hive has previously been confirmed as an option, but Apache Iceberg still needs such approval. A more complex example could be for a larger adoption effort to include moving from an on-prem HDFS implementation to a cloud provider’s object storage technology. These are just indicative examples of roadblocks that may hinder, or stop, a migration effort.

There may also be compatibility issues with other tools, frameworks, processes, or systems that might only surface once an effort to adopt Iceberg is underway. This is not likely to occur when using Trino-based solutions like Starburst Galaxy and Starburst Enterprise due to the single point of access model that decouples clients from the underlying data access layer. 

These roadblocks would have to be remediated and removed before a full adoption of Iceberg could occur. As these scenarios are likely rare for most, a migration to Iceberg for the majority of the tables is still encouraged. This is especially true for the largest and most heavily queried tables in the enterprise.

3. Still on HDFS

While the Hive connector does not require a Hadoop infrastructure to query tables in this format, many enterprises still have the underlying data files for these tables stored on the Hadoop Distributed File System (HDFS). The Hive table format maintains metadata about files by storing the directories they are stored in. At runtime, the processing engine has to list the contents of these folders to get the names of all the files needed.

This list operation is easily tackled by HDFS, but is expensive against object storage. For this reason, Iceberg tracks the needed files for each version of the table in the metadata files themselves. At Iceberg Summit 2024, CrowdStrike presented the massive reductions in their AWS S3 GET/LIST operations shown below.

Staying on HDFS does not cause Iceberg a compensatory penalty like Hive on object storage. Therefore, it is still recommended to move to Iceberg even if the underlying data lake storage remains in a legacy Hadoop infrastructure. However, if there is a Hadoop migration in your future, it would be more appropriate to adopt Iceberg along with, or shortly after, such a migration effort.

4. Wrong use cases

Some enterprises might not view the following functionality features as relevant enough to move them to adopt a new table format.

  • ACID transactions for
    • INSERT, UPDATE, DELETE, and MERGE statements
    • File compaction operations
  • Versioning benefits of
    • Time-travel querying
    • Table rollbacks
  • Full schema evolution
  • Hidden partitioning
  • Partition evolution

Even without considering the performance & scalability benefits, these inherent features should be reviewed again as they are a giant step forward for adaptability & usability from Hive and should not be casually dismissed.

5. No performance/scalability concerns

With multiple performance & scalability-oriented features of Iceberg built-in, it is almost guaranteed that all tables will perform better using the Iceberg table format over Hive. It is also understood that many of the tables implemented with Apache Hive are performing and scaling well today. This often drives the underlying desire to declare; that if it ain’t broke, don’t fix it.

If that’s the case, the real decision here is to determine when, not if, to adopt Iceberg on your large-scale & heavily queried tables. For some scenarios that may be so far in the future the migration effort for this reason alone is not warranted. Again, think of this on each of your large-scale tables individually when making this decision against moving forward with Iceberg. Revisiting the features previously mentioned should also factor into this decision. 

While we are not suggesting that Hive’s time has already past, it is also clear that all net-new efforts should be leveraging Iceberg initially. This immediately marks our Hive tables as technical debt that will eventually need to be addressed. Especially in the data lake analytics field, technology marches on.

Hive to Iceberg migration decision tree: Putting it all together

If moving from Hive to Iceberg makes sense from a benefits perspective, the next step is to determine the effort and resources that will be required – which involves planning. The following flowchart pulls together the criteria previously presented to aid in your decision to adopt, or not adopt, Iceberg. 

P.S. If you’re interested in why you should adopt Apache Iceberg, we recommend checking out the following video: