State of data catalogs: The battle for your metadata

Strategy
  • Tom Nats

    Tom Nats

    Director of Customer Solutions

    Starburst

Share

In this blog post, I will compare the state of data catalogs in Starburst Galaxy, Databricks and Snowflake.

First, Google Bard defines a data catalog as: a central repository of metadata that describes the data assets of an organization. It helps data users find, understand, and trust the data they need to make informed decisions.

We’ll go with that definition, but let’s add governance which is basically access control. 

Data catalog vs. Catalog

Second, let’s define the difference between a data catalog and a catalog which is used to store metadata about two different table formats. The most popular ones as of now are Apache Iceberg and Delta Lake

Apache Iceberg

Iceberg stores metadata data “next” to the table data files on object storage. There is typically a data and metadata directory. A “catalog” in this instance is a placeholder for a table and the latest snapshot .json file location. For more information on Iceberg, please see this set of blog posts. 

Delta Lake

In this table format, metadata about the table is kept “next” to the table much like Iceberg. The main difference is that only the table name and location of the table directory is stored in the catalog. 

Now that we’ve quickly covered what Iceberg and Delta catalogs contain, (pretty simple eh?) let’s cover how each vendor is handling working with these table formats and see if they can play nicely with each other. 

Now, let’s talk data catalogs and how each vendor is supporting table format catalogs for Delta Lake and Iceberg. 

Databricks (Delta Lake only)

Databricks created the Unity Catalog which only works with Delta Lake. It’s a fully featured catalog that includes governance (role-based access control). Metadata on objects are applied via “tags”. In the past, Databricks has supported external catalogs such as AWS Glue and Hive-compatible metastores. The problem with these repositories is that they are just used to store object metadata and not additional information such as access control and additional metadata. Hence the reason for the creation of the Unity catalog and the migration of their customers to it over the coming years. 

Databricks is migrating its existing customers to Unity and thankfully they have kept to their open source roots by exposing a Metastore API illustrated below. Although this is still in preview as of this writing, they are working with external engines such as Trino to make sure there is full interoperability between them. Cheers to Databricks on this!

To be clear, what is being exposed on this API is table information as well as the location of the Delta log as covered above. Any additional metadata, access rules,etc.. are all stored in the core Unity product located inside the Databricks environment. 

Third-party access control and cataloging tools such as Alation, Atlan and Immuta will be able to interact with Unity through an API. This is a common theme with all three products I’m covering today.

So, when a table gets created in Unity, that table is also available in the Metastore API seen above. Databricks also plans to support creation of objects from other engines such as Trino. Again, they are being a true citizen for having an open architecture.

Their support of Glue and generic Hive metastores is going away (they call them Legacy) which makes sense because they would like their customers to only use Unity. I would like them to provide a hybrid support for things like Glue but I see their reluctance to do it. Fair enough. 

Snowflake (Apache Iceberg only)

Snowflake has recently thrown its hat into this open ecosystem by providing support for Apache Iceberg. They have created their own REST-based managed Iceberg catalog and it’s in private preview as of now. In the latest demo from their conference, it was shown how they can pull Iceberg table metadata from AWS Glue and “sync” it with their catalog. It was said in this video that they plan to automate this. To be clear though, their proprietary Iceberg catalog will be the single truth of a table version for their end users. 

Snowflake doesn’t really have a data catalog like Starburst Gravity (covered below) or Databricks Unity Catalog, so we’re just talking about databases, schemas and tables. 

NOTE: As of this writing, Snowflake talks about how open Iceberg is and they are adding this support to work with other engines but they have denied Starburst’s entry into their private preview program numerous times so we can begin work to support their Iceberg catalog. This also applies to any other external engines, not just Starburst/Trino. Unlike Databricks, they talk about openness but that appears to just be marketing at this point. I hope they change their minds on this soon… 

Let’s discuss the diagram above. First, you have the Snowflake managed Iceberg catalog. This catalog holds the metadata for managed and unmanaged Iceberg tables according to the limited blogs and videos available (see note above, Snowflake won’t communicate with us in any fashion on this so this is all that can be gleaned at this time). From there, other engines such as Trino will be able to get a list of tables and their metadata so the tables can be queried directly from a customer’s object storage. Additionally, Snowflake supports a “refresh” command to pull Iceberg table metadata from AWS Glue and maybe other REST-based Iceberg catalogs according to this video.  So now they control your data and Iceberg metadata?

So, as you can see, both Databricks and Snowflake, in spite of claims of how open they are, will own and control their customer’s Delta and Iceberg catalogs. Common catalogs such as AWS Glue, Hive compatible metastores and other REST-based Iceberg catalogs will not be supported. It will be up to the other engines to support their custom catalog which hopefully will include the ability to write in addition to read.

Starburst Galaxy and Gravity

Starburst Galaxy is a fully managed data lake and query federation platform based on open source MPP SQL query engine, Trino – created at Facebook. 

Starburst Gravity is what I like to call a “hybrid” data catalog that lives within Galaxy. Gravity connects to over 21 different data sources, catalogs each of them and allows for additional metadata to be added such as description, tags, comments,etc..

Gravity highlights:

  • Data catalog – Pull in table metata from over 21 sources and offer the ability to add additional, searchable metadata.
  • Data governance – RBAC and ABAC (role-based and attribute-based access control) along with single sign-on (SSO) and System for cross-domain identity management (SCIM)
  • Data products – Natively curate, share, and govern data products. Data teams can easily create new data products by seamlessly joining data from your data lake and surrounding sources without the hassle of data movement

Galaxy supports many different types of Delta Lake and Iceberg catalogs as you can see in the diagram below. We believe our customers and the industry is tired of being locked-in and having their data and now, metadata locked into a certain vendor.  

As you can see above, Gravity connects to a variety of other catalogs allowing those repositories to hold the single source of truth for those objects which are tables and views. From here, Gravity will allow additional metadata to be created on these objects and stored in the Galaxy ecosystem even though they are located in an external catalog. Additionally, there is a new feature coming out where Gravity will keep a history of these objects even if they are located in another catalog. For example, if there is an Iceberg table located in AWS Glue, Gravity will keep history on that table even if it’s dropped or modified. Very cool!

The Snowflake rest-api is on our roadmap but as mentioned above, from our vantage point, it appears that Snowflake is not working with the community to make it easy for other engines to access their closed Iceberg tables…

Starburst isn’t interested in locking table metadata away into a proprietary vendor or solution, we have always believed in an open ecosystem and that includes precious metadata like this.

Wrapping Up

In this article, I have attempted to summarize what all three vendors are doing in the data cataloging and open table format catalog areas. 

Let’s quickly review these again:

Databricks: Table metadata (Delta Lake and possibly Iceberg and Hudi with their Uniform feature) will be kept in Unity but they are providing an interface to other engines to read/write to this repository. 

Snowflake: Iceberg table data will be kept in their Snowflake managed Iceberg catalog. It remains to be seen what sort of operations they allow.

Starburst:  Support for Hive, Iceberg, Delta Lake and Hudi in a variety of catalogs including AWS Glue, traditional Hive compatible metastores and newer REST-based catalogs like Tabular! 

This may all may change in 2024 but the writing is on the wall that although vendors will support open file formats such as parquet, orc and avro, table format metadata is highly coveted as the next frontier of lock-in and will be tightly controlled and owned by some vendors while other vendors will continue to offer true open architectures ensuring no data (and now metadata) lock-in…