Governing data products at scale: Promoting collaboration between producers and consumers
Lauren Kim
Customer Marketing Intern
Starburst
Lauren Kim
Customer Marketing Intern
Starburst
Share
More deployment options
At the recent Governing data products to scale session, Laurent Dresse, the Chief Evangelist at DataGalaxy, promotes collaboration between data product producers and consumers.
DataGalaxy is a data catalog or data knowledge catalog solution provider with over 170 customers worldwide, known for its steady growth and eight years of innovation. Their SaaS platform is easy to start, agile, and fast, with 90% of customers implementing their first use case within two months.
DataGalaxy’s solution is highly regarded on platforms like Gartner and G2 and is versatile, being deployable on Azure, AWS, GCP, or OVH in Europe. Their customers come from various industries, all facing similar data-related challenges.
Understanding data governance
In order to better govern data products at scale and promote collaboration between producers and consumers, it’s essential for these groups to have a comprehensive understanding of their data products. Data governance is less about technology and more about solving issues related to people, communication, and collaboration.
Effective governance requires producers and consumers to work together to define the characteristics of data products. This includes documenting their lifecycle status, quality expectations (e.g., error rates under one percent), security measures (e.g., access policies), and privacy dimensions (e.g., whether sensitive data needs to be anonymized or scrambled).
Ethics in AI and organizational challenges in data governance
Additionally, ethics in AI revolves around the data that feeds it and the data products themselves. It’s crucial to determine the legitimate use of data products, which forms the basis of their governance. By clearly exposing this information on a self-service platform, consumers can understand the purpose and usage guidelines of data products.
This understanding extends beyond producers and consumers to the entire organization. Daily challenges for data producers and consumers include managing data products and navigating potential friction between different organizational parts, such as producers, business units, and other departments.
To manage data products at scale, it’s essential to standardize processes and operations. Producers face challenges in understanding the use case for each data product, which requires constant inquiry into the purpose and intended consumers. This necessitates interaction with stakeholders to define use cases.
The role of effective communication and collaboration
Effective communication is crucial for success. Producers must identify data sources (e.g., Snowflake, Oracle, AWS) and targets (e.g., BI dashboards, reports) to build the appropriate product. Finally, it’s important to ensure that this information is accessible to the organization in a self-service manner, addressing the needs of both producers and consumers.
Consumers need to understand the context of data products, which involves easily finding them on a self-service platform and comprehending their content and context. This requires a thoughtful process and interaction with data product owners or producers, creating a communication loop.
Scaling the data-driven decision journey hinges on the human aspect, ensuring collaboration in defining and governing data products. Successful scaling involves defining data products based on requirements, exporting this information to a self-service platform, and repeating the process with business stakeholders, domain owners, and data product owners.
A scalable approach to defining and building data products
Although there isn’t a one-size-fits-all solution, the suggested approach is a strong start for initiating a data product strategy with proven scalability. Defining a data product involves a WidiWig approach, meaning what you see is what you get, where producers and consumers collaboratively identify and select data sources from an existing data catalog.
This process begins by drawing out the data product, identifying necessary data sources, and then building the data product pipeline. This hands-on, visual approach helps ensure that all relevant information is effectively integrated and utilized.
With Starburst, the process involves detailing how information is extracted from various sources and domains, how pipelines are created, and how data products and underlying datasets are generated. This documentation is essential for involving the consumer part of the organization to define business use cases.
For example, a consumer might request a risk analysis dashboard showing delinquent payments or FICO scores. This approach ensures a comprehensive understanding of data
flow—from extraction and transformation to storage and consumption. The dynamic nature of the data catalog means updates are reflected in real-time, enhancing collaboration and scalability of the data project pipeline.
Comprehensive Governance and Integration of Data Products
Governing data products after their creation is crucial. This involves using a data catalog solution to describe the data product in detail, including its lifecycle status (e.g., in progress, validated, reviewed, obsolete) and the specific use case, such as risk management reporting. Governance also requires incorporating policies on ethics, access, security, quality, privacy, and data contracts. For example, a data contract might stipulate an error rate of less than one percent. Effective governance combines technical knowledge from data producers, sources, and pipelines with the business context, use case, and governance dimensions to ensure comprehensive management of data products.
This now leads to the two-way integration with Starburst, where information documented in the data catalog is pushed to Starburst to generate the data product on their platform. Once developed, updates from Starburst are sent back to the data catalog. This integration makes the data catalog a central self-service resource for searching and understanding data products and their content.
By leveraging a search engine within the data catalog, similar to Google, allowing users to search for specific data products such as “risk.” The system displays insights, indicating available products like default risk analytics recently generated. Users can then contextualize these data products by viewing detailed information, including datasets and specific fields like credit card information.
This approach enables consumers to understand the structure and operational details of data products, such as refresh dates and lifecycle status, ensuring comprehensive documentation of controls and aspects.
Monitoring and evolving data products through quality scores and policy visualization
In discussing data quality and policy visualization within the data catalog, monitoring and evolving data products over time is crucial. Data quality scores retrieved from Starburst inform adjustments, such as implementing additional controls to meet expectations like a 1% error rate. Policies defined through collaboration between consumers and producers ensure consistent adherence.
Data products, like any product, evolve—starting with version one and progressing based on quality metrics, usage, and consumer feedback to potentially reach version two. This evolution is centrally tracked and accessible in the data catalog for all stakeholders.
Communicating technical details and ensuring visibility of data products
After producers and consumers have finalized their data product with documentation on governance, quality, and contracts, the challenge is to communicate these technical details effectively to non-technical audiences within the organization. Using visual representations to explain the data product comprehensively, can ensure clarity across departments.
This approach involves outlining domains, data product rules, and business assets to demonstrate the macro view and integration of the data product within enterprise processes. By providing business definitions and technical details, such as data sources like Snowflake or AWS, the goal is to facilitate understanding and integration of the data product across the organization.
Compare the necessity of exposing and explaining data products within an organization to showcasing luxury jewelry in a high-end shop on Fifth Avenue — without visibility, potential customers won’t know what’s available.
DataGalaxy emphasizes making data products visible and understandable throughout the organization through visual representations. The key takeaway is that by defining and building data products effectively within their platform, DataGalaxy facilitates scalability and organization-wide adoption.
For more information, watch Laurent’s complete presentation from Data Universe 2024 below: