Big data has been with us for a couple of decades. In that time, it has already moved through several different eras, each one facilitating a tectonic shift in the wider data landscape. I’ve seen it first hand in my work with Trino and other projects. Lately, things have felt like they’re reaching a tipping point, with a new approach taking hold.
How did we get here? We could divide the different eras of big data in the following way.
Era | Technology | Characteristics |
1 | Hadoop Distributed File System (HDFS) and MapReduce | The very beginning of Big Data as we know it. Datasets are finally able to reach a significant size for the first time, but the use of MapReduce inhibits the useability of queries. Lookup is slow and unadaptable. |
2 | Hive | Built on top of Hadoop, Hive introduces a SQL interface, vastly improving usability and expanding the value of Big Data. Hive is not designed for data that is updated often or for transactional data. Even with the addition of ORC ACID, Hive tables are only partially ACID compliant and handle data being updated and deleted poorly. |
3 | Iceberg and Delta Lake | The present day modern data lakehouses. These formats add an extra layer of metadata providing better performance and fully reliable ACID guarantees. |
Displacing Hive’s data gravity
If Hive taught us nothing else, it’s that data has gravity, and getting enough momentum to switch between table formats is difficult.
While I don’t have exact numbers to reference here, it should go without saying that the amount of data currently stored in the Hive table format across all organizations is staggering. Hive has been the dominant table format used in big data for some time, and most traditional data lakes running today use Hive. Even as the Hive query engine has largely been phased out in favor of modern tools, the Hive table format has remained.
However, we’ve been seeing a trend recently. Modern table formats like Iceberg and Delta Lake are starting to make a significant dent in Hive’s dominant position. As data engineering organizations finally make this migration from Hive, they need to make the right choice for what’s next. This signals a transition not only in the technologies being used, but in the data gravity surrounding those technologies.
Similarities between Iceberg and Delta Lake
Before addressing why I believe Iceberg will be the dominant table format of the next phase of big data, it’s worth pointing out the ways that the two are similar.
Fundamentally, Iceberg and Delta Lake are both an organizational layer used to manage Parquet files. Both add similar levels of ACID compliance when updating data in those files, and both have the flexibility to let your schema change without having to rewrite those files. Development on both projects is happening at breakneck speed.
So if Iceberg and Delta Lake are so similar, how does Iceberg come out on top? The answer is interesting.
Here are the 8 reasons why all roads lead to Apache Iceberg tables:
1. Delta Lake can’t be the default; even with Databricks’ acquisition of Tabular
The key difference between Apache Iceberg and Databricks’ Delta Lake comes down to ecosystem. Delta Lake is, and always will be, designed as the storage layer for a Databricks environment.
This means that it’s optimized for one company’s technology above all else. This makes it hard for Delta Lake to shift the data gravity away from Hive because it doesn’t have broad, cross-platform adoption. Databricks is a powerful player, but it’s just one figure in a competitive ecosystem. While Starburst Galaxy and Starburst Enterprise both support Delta Lake at scale, very few other platforms do.
Moreover, with Databricks’ acquisition of Tabular, it’s important to keep in mind, Iceberg is a community-driven project. Top committers span from a range of companies including Apple, AWS, Alibaba and Netflix. This is in stark contrast to Delta Lake, which is effectively an open sourced Databricks project. Iceberg is not Tabular, and Tabular is not Iceberg.
Apache Iceberg’s open ecosystem
Unlike Delta Lake, Apache Iceberg was designed to create an open lakehouse platform from the very beginning. The keyword here is “open”, a theme that runs through Iceberg’s current popularity.
2. Apache Iceberg’s open development strategy
Iceberg’s development strategy is a clear example of that openness. Any system working with Iceberg tables, whether Spark, Trino, Snowflake, or a standalone application all run a shared set of core libraries called Iceberg Core. This library goes far above and beyond a simple client, it includes the ability to run complex operations on Iceberg tables.
Importantly, this allows the features that go into Iceberg to be written with a core-first approach. Because of that approach, new code in core becomes available to developers who work on Spark, Trino, Flink, Snowflake Dremio, etc. all at the same time. This makes it very easy for tools to incorporate new features from the Iceberg specification into their systems.
3. Delta Lake inhibits openness
Delta Lake, on the other hand, treats shared code as an afterthought. The specification itself is public, but code is very difficult to share between Delta projects as it is always built with Spark in mind.
Databricks has tried addressing this over the years with the Delta Standalone project and recently Delta Kernel, however these projects have always provided a fraction of Delta Lake features because Databricks lacks the core-first approach Iceberg has. Databricks is just not incentivized to slow down development on their product to make Delta Lake easier to use for other tools.
4. Apache Iceberg’s design philosophy
So, what’s the point? In my experience, it comes down to a philosophy of openness versus a tendency towards insularity. It’s the key differentiator that lets open source projects thrive, and it’s set to propel Iceberg beyond Delta Lake because it fosters a community of collaborators.
This positions it perfectly to be the leader in the current era of data, which is characterized by heterogeneity, flexibility, and plurality. This question of openness is the main reason that Iceberg has contributions from a wide range of tech companies while Delta Lake is maintained by only one.
Apache Iceberg was built for Trino
With all this talk of openness, it’s easy to forget what Iceberg has that’s unique about it, a direct connection to Trino. This is something that I almost take for granted as a Trino contributor, but it’s one of the key differences.
5. Why Iceberg belongs with Trino
Back when Iceberg was first introduced by Netflix, it was originally designed for open source Trino. This really merged two of the newest and most innovative technologies in the data space: Trino + Iceberg.
This combination brought together a highly scalable query engine designed for petabytes of data with a modern, open table format designed to offer data warehouse-like functionality on data lake object storage. It is the perfect combination, one dealing with the compute and the other with storage.
Why openness saves saves you money
I’ll be the first to admit that talking about collaboration and sharing as core reasons Iceberg is better than Delta Lake can sound naive. However, what those core tenets of Iceberg really provide is flexibility, and flexibility will save you money.
6. Decoupling saves money
Separation of storage and compute was a perfect example of this. Today it’s the dominant approach to cloud computing, but the approach itself is built on openness. The flexibility of being able to scale storage up and down independently of your compute resources based on demand was an innovation that brought huge cost savings. You are no longer locked into having too much compute or too much storage at a given moment. Instead, you can have just what you need.
7. Apache Iceberg builds an open data stack
Iceberg takes this age-old invention a step farther, giving you the ability to pick the cheapest type of compute for a task.
Want to use one tool for ingest, another for running SQL queries, and a third one for data management? You can do that.
Contrast that with all-in-one platforms like Databicks and Snowflake, where the only way to do anything is use the tool they have provided. When that tool gets more expensive, well, you just have to pay more.
8. Openness gets results
But you don’t have to take my word for it, here’s a quote from Yello Sr. Data Engineer, Peter Lim, who shares his perspective on how their legacy data warehouse migration to Iceberg went.
“Starburst and Iceberg have enabled us to be tool-agnostic, avoiding vendor lock-in. The federated approach and open format have allowed us to create microservices on the data side, providing flexibility and preventing dependency on a single vendor for visualizing our data.
As a result, we reduced our compute costs by 12X. This efficiency allows us to focus our attention on using analytics for revenue-generating opportunities.”
The Icehouse era
Today, this combination is finally being recognized with growing talk about the data Icehouse. It’s one of the main reasons that both Starburst Galaxy and Starburst Enterprise offer an amazing experience on Iceberg. Both make use of the Trino engine at their core. This continues the Trino + Iceberg story but adds more usability and a SaaS platform into the mix.
It’s a compelling idea, and brings together all of the elements that I’ve been talking about here.
Optimize your data lake with Iceberg and Trino
Explore the future of data lakes through our comprehensive technical how-to whitepaper.