What is the best way to query Apache Iceberg tables?

Strategy
  • Monica Miller

    Monica Miller

    Developer Advocate

    Starburst

  • Kyle Payne

    Kyle Payne

    Starburst Data

Share

The summer of 2024 has ignited a revolution in the data lake space. As the dust settles, one thing is clear: Apache Iceberg has emerged as the table format of choice across the industry. With this enters a new focus, finding the best way to query those Iceberg tables.

This choice comes with many options. Iceberg is compatible with multiple analytics engines and is currently adopted in numerous ecosystems, leaving the freedom of choice to be a daunting task. The goal is to optimize quantifiable variables like price and performance, and intangible variables like ease of use and optionality, within one executable plan for your data architecture.

This leaves many wondering, “What is the best way to query my Apache Iceberg tables?

The power of Apache Iceberg 

Apache Iceberg is a high-performance, open table format that has set a new standard for cloud object storage use, enabling the data lake to achieve efficient data management previously confined to the data warehouse. Built at Netflix in 2017, Apache Iceberg was initially developed for Trino, the underlying query engine powering Starburst.

Importantly, Iceberg allows multiple engines to read and write from the same data, creating a true openness for compute workloads. In this sense, Iceberg becomes a uniform table format, ensuring all engines start on the same footing. As a result, the choice of compute engine becomes pivotal, allowing an opportunity to maximize performance and cost efficiency.

With this new popularity, many cloud data warehouses and compute engines have incorporated varying support for Apache Iceberg to meet community demands. This includes major players like Amazon Athena, Amazon Redshift, Snowflake, BigQuery, and Trino, the underlying compute engine powering Starburst. Databricks recently purchased Tabular, a managed platform developed by the creators of Apache Iceberg. However, the Databricks platform has not yet added native support for Iceberg tables, eliminating this engine for consideration at the time of this analysis.

Not all query engines are created equally. With so many options available, let’s break down the key helpful considerations when evaluating this choice.

Multi-engine support avoids vendor lock-in

Cloud data warehouses like Snowflake, BigQuery, and Amazon Redshift store data within these cloud databases utilizing native table formats, making it extremely difficult, expensive, and time-consuming to move the data from one storage solution to another. This creates vendor lock-in, which can cost businesses significant time, money, and energy. In many cases, the work required to remove the data from the data warehouse eliminates the option, causing the data to become “stuck” in the existing architecture. Growing data volumes lead to growing data costs, and data teams often grossly underestimate how costly it is to rely solely on a cloud data warehouse.

For this reason, many have explored building a cloud data lakehouse. You can use this architecture sufficiently on its own or as a complementary solution to your existing cloud data warehouse. In either case, the data lakehouse can provide both scale and flexibility previously thought unimaginable.

Evaluating Trino for your data lakehouse

If you’re open to evaluating a data lakehouse, the main consideration becomes picking a SQL compute engine. Among the top contenders, Trino-based query engines are leading the charge for the spot of Apache Iceberg’s most compatible managed engine of choice, which comes as no surprise due to Trino and Iceberg’s rich history.

Trino has been closely linked to Iceberg since the inception of the Apache project, as evident by Trino adding Iceberg support in 2020. Since then, the communities have worked together to accelerate the adoption of Apache Iceberg as the table format of choice, and companies such as Apple, Netflix, Pinterest, Stripe, and more have publicly discussed utilizing Trino and Iceberg as the foundation for their data lakehouse architecture.

Curious to see how the data lakehouse architecture stacks up against the cloud data warehouse? We’ve run a controlled experiment to share the comparison between the two evaluating price and performance, utilizing Starburst as the managed Trino service.

Benchmark analysis: Evaluating price and performance 

Methodology

As you evaluate cloud data lakehouses and cloud data warehouses, we encourage you to run your own benchmarks. This will allow you to use your own SQL, data, and testing for price, performance, and concurrency. If you are exploring Trino or Starburst and looking for help performing your own benchmarks, please feel free to contact us, and we will gladly assist you.

Comparing top cloud data warehouses and cloud data lakehouses

A recent benchmark test highlights the price and performance differences between Starburst, Amazon Redshift, Redshift Serverless, BigQuery, Athena, and Snowflake when querying tables with a concurrency of five users. Each query engine utilized a comparative cluster size to estimate a similar amount of compute resources available per cluster. The queries executed in AWS for Redshift Serverless, Redshift, Athena, and Snowflake, whereas BigQuery queries executed in Google Cloud. Starburst Galaxy ran in both AWS and Google Cloud to accurately compare each respective cloud.

Running the experiment

For this experiment, each engine executed a series of standardized queries from the TPC-DS sf1000 dataset to equate to one standardized run. Each run was performed 25 times to factor in the benefits achieved with indexing and caching. In order to evaluate standard performance, all queries ran on Iceberg tables. This means in order to truly replicate Iceberg’s multi-engine open standards, the experiment excluded native tables stored in a proprietary format. Cost calculations utilized publicly available list prices for all vendors.

Results for price and performance

The results are illuminating, focusing on both price and performance as independent factors while also analyzing price and performance as an interdependent relationship.

Starburst offers the best combination of price and performance

When comparing engines, it’s best to consider both performance and price together for your workloads. If measured independently, you may end up with an engine that will cost significantly more for marginal single-digit performance gains, or a cost-effective choice that lacks scale and performance. Instead, the longer-term impact should assess both performance and costs, setting your teams up to deliver insights without breaking the bank.

Starburst offers the best combination of performance and cost for Iceberg compared to other SQL engines either residing in the cloud data warehouse or powering a cloud data lakehouse. The significant disparity can be attributed to Starburst’s enhanced smart indexing and caching layer, Warp Speed, which is responsible for autonomously optimizing query execution and decreasing the need for extensive compute resources. Starburst’s Warp Speed technology reduces cloud compute costs by up to 40%, which is particularly beneficial for businesses looking to maximize their budget without compromising on performance.

Choosing Starburst to run Apache Iceberg queries

The decision to choose a SQL query engine and data architecture is individual to each organization’s specific needs. However, there is no denying the rise in popularity of cloud data lakehouses due to low-cost object storage, modern table formats, and efficient SQL query engines, like Trino or Starburst. For those looking to incorporate Apache Iceberg into their data ecosystems and taking advantage of the opportunity to update their existing architecture, it may make sense to evaluate a cloud data lakehouse over a cloud data warehouse for the undeniable price and performance benefits. Contributing flexibility, ease of use, optionality, and preventing vendor lock-in, there’s no denying that a data lakehouse can at the very least supplement existing data environments as data volumes grow and data complexity increases.

If you are considering or already using Iceberg tables, Starburst represents a powerful and economical option, ensuring that data-driven decisions are not only fast but also cost-effective and successful at enterprise scale. Starburst recognizes the success of utilizing Trino and Apache Iceberg to build its own open data lakehouse, the Icehouse. The Icehouse is an architectural blueprint for the most optimal open data lakehouse with Trino as the query engine and Iceberg as the table format, with the goal to bring warehouse-like experiences on a truly open data lake foundation.

To understand first-hand experiences of organizations upgrading to a data lakehouse and migrating to Apache Iceberg, read Kovi’s journey toward achieving substantial cost savings, including 85% faster ad-hoc queries, 75% reduction in AWS S3 GET costs, and 55% faster ETL jobs.