Data Catalog
What is the difference between data catalog and metadata?
Data catalogs use metadata — data that describes other data — to make assets easier to understand and use. Metadata provides context that helps business intelligence analysts and other users evaluate data for quality, currency, and trustworthiness.
Catalogs can contain three types of metadata: technical, operational, and business.
Technical metadata describes a data asset’s structure, including any tables, columns, and rows. Users can see the types of data stored in the asset as well as its residency.
Operational metadata documents an asset’s creation and change history — its provenance and lineage. In addition, operational metadata can reveal who has accessed data and what they have done with it.
Business metadata adds context, such as scores or classifications, explaining data’s relevance to the business. An enterprise data catalog will include a business glossary that defines terms consistently across the organization.
Related reading: What’s the difference between a data catalog and data products?
What are the benefits of data catalogs?
Data catalogs play a crucial role in data management. Engineers rely on this information for data asset curation, user access control, and compliance with governance and regulatory requirements. End users depend on catalogs to find the right data for their projects. By making enterprise data more accessible, reliable, and productive, data catalogs directly contribute to better data-driven business outcomes.
Other benefits of data catalogs include:
Data discovery
Catalogs overcome one of the greatest challenges in the data-driven enterprise: simply finding the right data. Despite centralization efforts, modern enterprises still rely on a distributed mix of on-premises servers and cloud storage services controlled by multiple domains and geographies.
Using a data catalog is like looking for something on Amazon. When shopping, you navigate into departments and sections before filtering the hundreds of results by product attributes. Those attributes change depending on the type of product you’re looking for—books and dog beds have different metadata.
Data catalogs work the same way. This data shopping experience lets users discover the best data for their projects faster.
Metadata management
Besides providing a central reference, catalogs help manage the metadata itself. For example, data becomes hard to find and reconcile when one asset uses “customer ID number,” and another uses “Customer ID No.” to describe the same data. Data stewards use catalogs to ensure assets apply metadata consistently.
Ultimately, effective metadata management reinforces a sense of trust among data users. They know that the metadata they use is correct and consistent and will let them quickly discover the data they need.
Data governance, security, privacy, and compliance
A catalog’s metadata includes every asset’s provenance, lineage, residency, and access history. This information is an essential component of data governance. Catalogs make it easier to support audits and monitor governance compliance.
A modern data catalog helps companies automate regulatory compliance. For example, data privacy regulations vary geographically. Provenance and residency metadata will let access control systems automatically grant, limit, or deny access to sensitive data based on the data’s origin and storage location.
Improved data lineage and provenance
A catalog provides a reliable way to evaluate data’s origin and change history. For example, financial analysts evaluating historical trends need to know when the basis for calculating a product’s bill of materials costs changed. The catalog lets the analysts know how their data integration workflows must account for the differences.
Increased data accessibility
Discovery becomes a protracted exercise of trial and error without a catalog. Even worse, inconsistently formatted data could go undetected until after the analysis returns poor results.
Catalogs make data users more productive. They spend less time exploring data sets. When they find something promising, the catalog’s metadata lets them quickly evaluate whether the data is fit for purpose.
Accessibility is particularly important for big data initiatives such as machine learning or artificial intelligence projects. Data scientists can explore data sets across the company, confident that the catalogs contain consistent, accurate, and up-to-date metadata.
Data democratization
Data catalogs are critical resources for data-driven business cultures. Democratized data access empowers users throughout the organization to discover, extract, and analyze data to support more effective decision-making.
This empowerment requires a self-service model that ends people’s dependency on data team expertise. Well-managed catalogs put data at everyone’s fingertips, letting them drill down through data layers and search for the information they need as easily as if they were shopping online.
Enterprise data catalog: What is a data catalog example?
Catalogs provide the framework for unifying distributed data sources in Starburst’s data lake analytics platform. These catalogs surface the metadata consumers use to quickly discover the data they need.
When engineers configure a data source’s catalog, they specify a Starburst connector designed for that data source. Starburst offers connectors for most enterprise data sources, including Amazon S3, Azure Data Lake, Google Cloud Storage, MySQL, and Snowflake.
In addition to providing secure, encrypted communications between users and sources, these connectors perform the tasks traditionally played by custom ETL pipelines:
- Translate data types.
- Translate raw data.
- Handle variances in the data source’s SQL implementation.
- Adapt to the data source’s API.
Once configured, catalogs are accessible to any authorized user through the Starburst catalog explorer. This is the central hub data consumers and engineers use to browse through a catalog’s schemas, tables, views, and columns.
Automated data cataloging solution: Starburst Gravity
We introduced Starburst Gravity to address the challenges of catalog management. Gravity is a universal discovery, governance, and sharing layer in Starburst Galaxy that enables the management of all data assets connected to Galaxy. While catalogs can make data more discoverable and productive, enterprise data teams often struggle with:
Fragmented data infrastructure: Data and data ownership are scattered across sources, domains, and geographies.
Inconsistent governance: Ensuring consistent, high-quality data in a fragmented storage infrastructure is difficult without consistent standards and definitions.
Inconsistent metadata: Data assets are not consistently documented, making it harder to understand the data’s context and lineage.
Starburst Gravity creates a unified access and governance layer that lets data teams manage their catalog through a single pane of glass.
Automatic data cataloging is how Gravity pulls your disparate data sources together. When engineers connect a data source, Gravity extracts its metadata, including schemas, tables, and descriptions.
Gravity binds every data lake, data warehouse, relational database, and other sources no matter where in the world — or the cloud — your data may reside. Starburst’s data lake analytics platform uses Gravity to turn your enterprise-wide data assets into a powerful discovery, governance, and collaboration system.
Accelerating discovery
Starburst Gravity lets users discover data anywhere in the enterprise. In addition to the catalog explorer, Galaxy’s schema discovery reveals newly-added tables or views in a data lake. Data teams use schema discovery to promptly update catalogs, so users always have access to current information.
Galaxy’s universal search engine surfaces any data object in any data source. Traditionally, offering this capability to every employee would impose significant compute costs and subject users to bandwidth and latency limits. However, Galaxy’s search engine parses cached descriptions in the data catalogs, allowing unlimited compute-free discovery.
Center of governance
Gravity also streamlines data governance across the enterprise. In addition to fine-grained access controls based on user roles and data attributes, Gravity’s access control tags simplify policy enforcement at scale.
For example, placing a pii tag on a table’s column will identify personal data that may be subject to data privacy laws. Automatic security rules would combine this tag with the data’s provenance and residency attributes and the user’s role to enforce compliance with GDPR and other privacy regulations.
Bringing people together
Gravity lets end users and data teams quickly curate data products to enhance decision-making. A Gravity data product combines tables or views from multiple sources into a schema to support self-service analytics. Engineers can publish the product for widespread use or limit access to certain stakeholders to protect sensitive data or maintain compliance.
Gravity’s universal discovery, governance, and sharing layer fosters a more collaborative data culture. Deep and consistent metadata gives engineers, analysts, and business users a common language as they work together to extract actionable insights from enterprise data.