Demystifying data catalogs and data products

Share

In the data space, two terms often make their way into conversations, sometimes independently and sometimes together — data catalogs and data products. You’ve probably heard the terms used in presentations at conferences, read about them in data strategy articles, or perhaps you’ve used the terms to find products that try to solve your data governance problems. 

If you’re nodding in agreement, but are not grasping the distinction, you’re in the right place. The relationship between data catalogs and data products is crucial, yet nuanced. Later, we’ll unravel these terms together and also explore how Starburst Gravity’s features work synergistically to bring value to customers struggling with data discoverability, consumption, and trust.

Data catalogs increase data literacy

Both data catalogs and data products support your organization’s metadata management strategy and foster data literacy. However, they also serve distinct purposes.

Data catalogs serve as an organization’s comprehensive secure inventory and backbone of all data assets, mapping out what data is available and where it’s located. While the specific details differ based on data catalog product offerings, the key features of a data catalog are typically:

  • Automated & manual technical, operational, and business metadata management
  • Data governance features
  • Search capabilities
  • Collaboration tools
  • Data lineage functions
  • APIs and connectors to various data sources

Data catalogs augment data management, by establishing dynamic directories created on top of all existing data sources — they provide clarity and coherence, all the while promoting security. 

Consider the complexities of our modern data ecosystems — large number of data sources, varying data formats, and by extension the risk of establishing new or fortifying existing silos. Metadata management, connectors, and search capabilities makes your data ecosystem easily navigable, interpretable, and accessible without the need for a data consumer to rely on a data producer to determine how to use each-and-every source system of interest.

Further, with the increase in accessibility comes the problem of trustworthiness, “How do I know whether I should use this data?” Integrations between metadata management and data quality tools enable data consumers to determine the fitness of the immediate data set for use while data lineage enables the data consumer to view the provenance of data and quickly understand the quality of data flowing into the data set. This provides the critical information necessary to build trust around data for consumption and quickly answer the question around fitness for use.

Lastly, as regulatory compliance and data privacy become a mandatory component of business operations, the role of data catalog governance becomes of paramount importance. Data governance enables organizations to adhere to regulatory and data privacy rules, while ensuring this is transparent to the data consumers. Leveraging governance features of data catalogs, organizations can ensure that data consumers have access to the wide array of data in your data ecosystem, but only so far as what each consumer is allowed to consume. To the data consumer, this manifests in many forms including, but not limited to, search results constrained only to data sets they have access to, automated masking of columns to prevent leakage of PII data, and the inability to inadvertently modify data sets.

Data products increase data value

Having established the critical role data catalogs play in data democratization, we turn our attention to the other side of the coin – data products. A comprehensive inventory of data provided by a data catalog is essential, but is just one piece of the puzzle. 

For data to be useful, first and foremost, it must provide value to the business. Whereas data catalog’s strength lies in the ability to inventory everything, the sheer volume of data contained within means that it is an amalgamation of data, reports, spreadsheets and more that data consumers are challenged to sift through. One may go so far as to consider the manifestation of a data product to be data within a data catalog that meets certain criteria. However, it should be emphasized that a data product is more than just a technical manifestation. So what is a data product?

Before getting into any specific entity, let us be very clear — a data product must provide value to its intended consumers.

Data Products For Dummies, Starburst Special Edition

See the future of data products with large language models

Read now

3 characteristics of a data product: Structural, process, and functional

Here at Starburst, within the context of our Enterprise and Galaxy data platforms, we define a data product as a dataset that meets structural, process, and functional characteristics. Let’s break this down. 

1.Structural characteristics of a data product

First, a data product dataset is a package that consists of

  • Schema
  • one or more table, view, or materialized views
  • Rigorous business and operational metadata
  • Data Owner

2.Process characteristics — a set of actions taken in developing the dataset to derive quality and value

  • Data-as-a-product or applying product management best practices to find, derive, and ensure business value of the data
  • Software development lifecycle-like best practices to ensure operation with agility while ensuring quality output

3.Functional characteristics — criteria to be met to foster trust and encourage utilization

  • Discoverability. Publication in an easy to find, access, and search registry that enables consumers to discover and utilize high quality / high business value data sets
  • Understandability. Upon discovery, data products should provide robust business and technical documentation, as well as information about the underlying representation of the data with a goal of rapid comprehension of the nature of the data set that assists in quick decision making around consumption
  • Trustworthiness. The previous characteristic helps answer the question “is this the correct data product?”. This characteristic answers the question “should I use this data product right now?”. Information such as profile, lineage, and other trust validating metrics enables users the necessary information to make a decision on consumption 
  • Standardization for accessibility. Uniform specific standards that enable downstream consumers and data teams to consistently and quickly access any data product in the same manner
  • Interoperability. Data products should offer interoperability with the tools, skillset, and languages of the consumer’s choices. For example, a business analyst will prefer SQL whereas a data scientist may prefer python for their day activities in tools of their choice
  • Security. Data products should be governed such that they are secure, ensuring adherence to regulatory and organizational data security policies while also ensuring correct access to meet the needs of specific teams.

          When a data set goes through product processes, is placed into a container that meets the functional characteristics, and business and operational metadata is applied, the output is a data product that exists as a technical manifestation in a searchable catalog or registry.

          The result is an easy-to-find and -access body of data that is far easier to interpret both from the business and technical perspectives that can be distinguished from a lot of the noise a data catalog may bring.  This lowers the barrier of entry for consumption, and with the right feedback loop between the data consumer and data product owner, helps drive further iterative value of the various registered data products.

          Enter Starburst Galaxy and Gravity

          With Starburst Gravity, you can have your cake and eat it too. 

          Gravity is a universal discovery, governance, and sharing layer in Starburst Galaxy that enables the management of all data assets connected to Galaxy.

          Gravity provides a holistic platform that consists of the following that work synergistically:

          • Data Source & Product Cataloging
          • Universal Search
          • Centralized Data Governance
          • Data Product Creation & Management
          • Federated Queries

          Data cataloging with metadata management in Gravity enables you to increase data literacy and accessibility across your data sources and data products.   Data product creation and management enable data teams to create and register data products into a centralized registry for data consumers to view and utilize. The major differentiator of the Gravity data products feature is its ability to leverage the power of Trino

          Most data product workflows require centralization of data, resulting in significant process and technical overhead of data movement. Gravity data products can be created from data federated across multiple sources, and by leveraging logical Views, this can be done without any data movement. Teams are empowered to utilize and manage infrastructure of their choice, while being able to provide curated data for general repeat use without relying on a centralized data team, enabling the utmost in agility to quickly and iteratively provide data of business value.

          When data cataloging and data product creation and management are combined with Gravity’s centralized data governance, regulatory and PII policies can be uniformly and consistently applied to all data across all clouds in all regions regardless of whether it is a data catalog entry or a data product entry. This reduces not only overhead of compliance, but the chances of errors due to a simplified surface area of governance configuration.

          The ever expanding galaxy

          In the coming months, Starburst Galaxy will be introducing two new exciting Gravity features that can be ubiquitously applied across both the data catalog and data products

          • Data Quality
          • Data Lineage

          Data lineage will provide visibility into the data flow from upstream data sources of data products, enabling data consumers to more confidently determine the truthfulness of data flowing into a data product. This helps answer the question “does the provenance of the data product make sense”? This will also serve as an impact analysis tool, allowing data producers to quickly address data issues when they occur, or even risk assess and mitigate issues before any schema changes are executed.

          As a step in our data SLA strategy, introduction of data quality will allow data producers and consumers to collaborate to ensure data metrics indicating fitness of use can be established and monitored for both the catalog and data products. The visibility into the data product’s quality will help consumers answer the question “can I use this data right now?”. More importantly, this feature will enable data producers to monitor and react to data issues in a more expedient manner, often times before issues are seen by consumers, helping to foster trust and confidence between data producers and consumers, and ultimately further encouraging consumption of data. 

          The synergy of data catalogs and data products

          Data catalogs and data products have overlap in the value that they provide — namely increasing data literacy and interpretability. But in the grand scheme of things, they occupy different spaces in your data strategy.

          Data Catalogs serve as the secure backbone for an organization’s full data ecosystem. It enables all users democratized access to all data and assists in decision-making around consumption. This is also a double-edged sword in that data consumers are faced with access to all data and types within an organization regardless of its value.

          Data Products manifest as part of a data catalog, but are intended to deliver value quickly to the consumer by providing curated valuable data with a high degree of accessibility and interpretability in a secure and consistent manner. Gravity data products take this a step further by unshackling data teams from centralized data processes by minimizing and eliminating data movement and enabling the utmost in agility by leveraging the power of federated query in Starburst Galaxy. 

          With soon-to-be-introduced data quality and lineage in Gravity, we are excited to let you know that these features will also leverage the power of Galaxy to provide observability into your data regardless of source, cloud provider, or region when connected to Galaxy. Exciting times lie ahead in fostering data value generation and consumption and we hope you join our journey as we evolve our data platform offering.

          Starburst Academy: Exploring data products 

          Explore the exciting world of data products and learn how they impact data producers and data consumers.

          Start now