Why Apache Iceberg will accelerate competition for compute engines

Share

Apache Iceberg emerged last week triumphant, having won the race to become king of the data lakehouse. 

In many ways, this was a long time coming.

No longer the heir presumptive of the table format world, no longer the interesting new technology worth exploring, Iceberg is now the reigning monarch of the open data lakehouse following simultaneous adoption from both Snowflake and Databricks

It’s a shift whose idea has come at last, and one that we’ve written about and anticipated before, especially given Iceberg’s close connection to Starburst and Trino. 

But no matter what data platform you use, the move to Iceberg will have big implications, and its impact will push beyond table formats alone. In fact, the openness that Iceberg facilitates, promises to kickstart an arms race for compute engines capable of operating at scale on open data stacks. 

May the best compute engine win? 

Open compute engines for the open data stack

In this sense, the more things change, the more they stay the same. 

Just as the industry converges on an open table format, a new battle begins around open compute. The king is dead, long live the king.

In an open architecture where everyone is using Iceberg, the next logical question to ask is: which compute engine performs best with Iceberg? 

Here, Iceberg’s openness promises to set up a more direct comparison between compute engines set on an equal footing for the first time. This raises one fundamental question. 

Have open table formats given rise to open compute? 

Read on. 

How Apache Iceberg won the table format race

Let’s answer the question by looking back at how Iceberg won the table format race in the first place. 

Last week was a watershed moment, but Apache Iceberg isn’t new. In fact, it had already been talked about across the industry for some time, including by Starburst

What Iceberg really did last week was win over the holdouts, namely Snowflake and Databricks. 

Let’s look at how each embraced openness one by one. 

Snowflake Polaris

The first win for openness was Snowflake’s Polaris announcement. It outlined the creation of their own REST catalog centered on Iceberg. This opens the Snowflake platform, famous for its proprietary, closed approach to direct compute competition for the first time, including from Trino. This is a seismic event and one that promises to have many Snowflake users asking where they can get the best compute costs. 

Databricks and Tabular

The second win for openness was Databricks’ acquisition of tabular, a move largely designed to secure a foothold in the Iceberg ecosystem and integrate a REST implementation of Iceberg into their Unity catalog. The move promises to create a single catalog access level across Delta Lake and Iceberg. This will certainly change things for Databricks, which has always championed Delta Lake before this. 

Apache Iceberg > Delta Lake

One way of interpreting this news is that Delta Lake has lost, or at least been relegated to second place. 

In many ways, this was Delta Lake’s fight to lose. 

Not long ago, it seemed like Delta Lake would win this battle and be the default table format for the data lakehouse era, the one that would unseat the data warehouse from its traditional place of dominance. But even though Delta Lake had many of the same features as Iceberg, its technological DNA was too bound up with a single ecosystem, Databricks. 

It lacked a dynamic, open-source community backing it, and it didn’t fit well inside an open data architecture. Starburst Galaxy accessed it alongside Iceberg, but not all the engines could.

The Apache Iceberg era

In this sense, Iceberg won because it provided a combination of attributes that other table formats couldn’t–functionality approximating a data warehouse provided on an open data stack that championed the separation of storage and compute.

Both of these approaches will define the Apache Iceberg era. 

Let’s look at them one by one. 

Iceberg lets the data lakehouse beat the data warehouse

Iceberg won because it has fewer drawbacks compared to traditional data warehouses. In fact, its enhanced metadata collection allows for features that are usually only associated with data warehouses, including ACID compliance, support for transactional data, time travel, and schema evolution. 

This closes the gap between what had been a tradeoff between the high cost and superior features of a data warehouse and the low cost but more less feature-rich data lake. With Iceberg you don’t have to choose, and that’s a big deal. It allows businesses to embrace cloud object storage–so much cheaper than other storage methods–and swap in whatever compute engine works for them. 

Iceberg is open

Although every table format makes claims to being open in one way or another, Apache Iceberg embraced an open data architecture at its core. 

This is an important distinction. Iceberg isn’t just open-sourced. It represents a paradigm shift away from the model of a monolithic data platform providing an end-to-end experience, towards a model where data pipelines are composed of interoperable components, swapped out as needed. You can see this in the wide range of diverse contributions to the Apache project, ranging from companies such as AWS, to Starburst, to Apple.

Now, even the two most monolithic data platforms in the industry–Snowflake and Databricks–have also embraced an open approach. This move opens up the data stack in a way that few other moves could. It repositions the whole industry for a new, reinvigorated competition between compute engines, each vying to process the workloads of Iceberg tables. 

The new battleground over open compute engines

Iceberg levels the playing field and will encourage many organizations to adopt an open data stack. Its openness can’t be worked around and can’t be avoided. The big players have had to embrace it. 

Why consensus on table formats means more competition on compute

This opens up a new front going forward. If everyone is using Iceberg, then everyone is using some version of an open data stack and that means the ability to swap in or out compute engines. 

The question then becomes, which compute engine will come to dominate? This is the question that many businesses will ask themselves in the coming months and it signals the next big fault line in the data industry. 

Which compute engine works best on Iceberg?

To answer the question, you have to look at Iceberg’s architecture and ask which technologies work best in conjunction with it. Trying to get a leg up in this space is almost certainly why Databricks acquired Tabular, but it’s likely to be a far more open competition than the battle between a few platforms. 

Why Starburst Galaxy performs best on Iceberg

All of this is good news for Starburst Galaxy. It was designed to compete in exactly these conditions, an open data architecture using Apache Iceberg, but supporting Delta Lake, Hudi, and Hive.

Trino

One big reason for this is Trino, the engine that powers Starburst Galaxy. When it was developed at Netflix, Apache Iceberg was originally built to run on Trino, and the two technologies have been closely linked together ever since. 

As the world moves to embrace Iceberg increasingly over time, Trino is well-positioned to move into this evolving space. Compared to other compute engines, Trino works particularly well with Iceberg, and the connection between the two technologies runs deep.

Icehouse Architecture

Starburst has been so excited about the combination of Trino and Iceberg that we’ve even given it a special name, the Icehouse

An Icehouse architecture consists of Trino and Iceberg at its core but also includes four key components using Starburst Galaxy: 

  • Data ingestion
  • Data governance
  • Data management 
  • Automatic capacity management 

This perfectly fits the needs of the moment, matching an industry wide shift towards an open architecture with technology designed to work best on Iceberg. 

Open data architecture is the modern data stack

With the whole industry shifting towards a more open data stack using Iceberg, a more natural competition for compute technologies will be the next battleground in the industry. This is a big shift, with big implications that won’t be visible right away. 

But as data engineers survey the landscape and review which technologies offer the best performance on Apache Iceberg, the Starburst Icehouse architecture is perfectly positioned to fill the need created for a compute engine that can scale, handle multiple data sources using data federation, and perform well on Iceberg. 

In the end, the Icehouse may win the compute wars in the same way that Iceberg did with the table format wars.