Announcing Starburst & Trino Support for Polaris Catalog for Apache Iceberg
Share
More deployment options
Since introducing the Icehouse concept in February 2024, we’ve seen a growing number of organizations looking to adopt the architectural patterns of Trino and Apache Iceberg across their data environments. Their primary question has been: How will it fit within their existing architectures?
Today’s announcement of open-sourcing Polaris Catalog is the next step in that integration journey.
We want to share why we’re so excited to be a launch partner, a few things we’ve learned so far, and what is coming next regarding our integration with Polaris Catalog.
Why does Apache Iceberg need a catalog?
Before sharing our excitement, let’s discuss why a catalog is needed and what the Apache Iceberg REST Catalog specification is. While you may know Iceberg as a table format that allows for multiple engines to safely perform updates and deletes within a table, users of Iceberg need a way to keep track of the different tables.
Every table mutation in Iceberg (e.g., INSERT or DELETE), creates new metadata files representing the new state of the table. This metadata contains information such as schema, partitioning info, and the list of data files that comprise the table. This metadata structure is fundamental to how Iceberg supports its optimistic concurrency model, which is an essential requirement for multi-engine interoperability. Therefore, the primary responsibility of an Iceberg catalog is to maintain a pointer to the location of the current metadata for a given table name. Check out our recommended blog series to learn more about the Iceberg internals.
Originally, AWS Glue and the legacy Apache Hive Metastore were the most popular catalogs for Iceberg tables. More recently, the Iceberg REST catalog specification was introduced in Iceberg 0.14.0 to provide more specificity of how an Iceberg catalog should be implemented. An Iceberg REST catalog can be implemented in any language, whether proprietary or open source if it adheres to the Iceberg REST Open API specification. As the REST catalog has become the standardized catalog protocol, more open-source and commercial catalogs are becoming compatible with the specification. That brings us to our next question.
What is the Polaris Catalog?
The Polaris Catalog is an open-source catalog for Iceberg that implements Iceberg’s open REST Catalog specification. The goal of Polaris Catalog is to create a shared data layer that enables multiple engines to read and write to the same data sets. As of today, Polaris Catalog integrates with a wide variety of open-source and commercial query engines, including, but not limited to, Trino, Apache Spark, Apache Flink, Starburst, Snowflake, and others.
In addition to allowing multiple engines to read and write to Iceberg tables, Polaris Catalog also provides an access control layer. Remember, Iceberg is a table format and does not provide access control for granting or denying users to perform operations on the tables. Polaris Catalog provides this missing capability to create access control policies that are enforced as users interact with Polaris Catalog from multiple engines. This can help ensure consistent data governance of Iceberg tables as you interact across engines.
How do Iceberg & Polaris Catalog Open Up the Cloud Data Warehouse?
Until today, cloud data warehouses have mostly been closed ecosystems that rely on proprietary data and file formats, only accessible using a vendor-provided processing engine. Over the past several years, we’ve seen customers push back against this trend, instead turning to more open architectures like the open data lakehouse, where data remains in open formats at the table and file level, accessible by a growing ecosystem of open-source and proprietary engines.
Spurred by customer demand, it has been encouraging to see more cloud data warehouse vendors pivot to provide support for open formats like Apache Iceberg. However, the challenge of who owns and has access to the metadata effectively made these systems proprietary, keeping customers “locked in.”
Now, with Polaris Catalog, cloud data warehouse customers can free their data with Iceberg and their metadata with Polaris, allowing other engines, like Trino and Starburst, to directly access previously locked-down data alongside the existing cloud data warehouse closed source engine.
At Starburst, we refer to this ability to choose which engine to use as optionality. Cloud data warehouse customers will now be free to choose the best engine for the unique needs of each workload, guaranteeing the best price-performance at every level.
Starburst and Trino Integrations with Polaris Catalog Now Available
So, what are we excited about as a launch partner for Polaris Catalog? As a long-time supporter and integrator of Apache Iceberg, Starburst has been compatible with the Iceberg REST catalog specification since its introduction two years ago. This means that Starburst Enterprise and Trino both support the Polaris Catalog out-of-the-box!
We are quickly working to add support for Polaris Catalog into Starburst Galaxy, our fully managed platform, and it is expected to land soon.
When should I use Starburst with Polaris Catalog?
For companies hitting scale or cost issues with their cloud data warehouse, adopting Starburst alongside the Polaris Catalog empowers teams to quickly support new workloads in a fast, cost-efficient manner.
For example, with Starburst running alongside Snowflake, you can:
- Easily explore and transform your data directly in the data lake
- Optimize your budget by choosing the best engine for each workload
- Adopt the leading Icehouse architecture without an expensive migration
You can get started with Trino or Starburst Enterprise by following the documentation for the REST catalog here. We will follow up with a getting started guide soon.