As the end-to-end open lakehouse platform, it is critical for us to ensure that data downtime issues can be mitigated proactively or resolved as quickly as possible. The inability to do so puts a significant strain on business whether it be data teams being called to action at 2 a.m. or leaders unable to make timely, informed decisions–or any decisions at all, for that matter.
As accessibility and usability of data at scale are crucial to the successful operation of businesses, we started to introduce initial capabilities around data profiling, quality, and lineage in Starburst Gravity – our universal discovery, governance, and sharing layer in Starburst Galaxy. These capabilities help data teams understand the flow and shape of their data more quickly, and the feedback we received was, “We want more.”
Today, we’re excited to announce the public preview of three major enhancements in Galaxy that enable data teams to reduce data downtime:
- Column Lineage
- SQL-Based Data Quality Checks
- Schema Change Monitoring & Comparison
With today’s announcement, our vision is to provide our customers with out-of-the-box capabilities that increase visibility and simplify observability while continuing to enable the use of task-specific observability tools when needed.
Let’s dive in to explore the details.
Extending Table Lineage to Column Lineage (Public Preview)
Gravity previously introduced table lineage to view pipelines and data flow within Galaxy. This provides quick visibility into all assets that need to be considered for impact analysis, troubleshooting, or schema evolution. However, it still required users to dig into code to really understand the data flow across their ecosystem.
Today we introduce column lineage, which captures and visualizes data flow between all columns. The richness of information enables teams to carry out more thorough impact analysis before any changes are made to data pipelines, with a significant reduction in tedious code reviews. The same benefit is afforded for troubleshooting data as well – origins of data quality or accessibility issues are quickly traced back to the upstream source.
Further, with the ability to carry out concrete impact analysis, data teams can identify and communicate with the owners of affected data assets before data issues manifest. Lineage is captured automatically and in real-time when queries in Galaxy result in data flow – no additional configuration is necessary. You will only need to wait until pipelines kick-off automatically or are manually triggered.
To access the lineage feature, follow these simple steps:
- Enable View all data lineage account privilege for your role
- Navigate to a table-level entity in the catalog that sent or received data from another table-level entity
- Click on the Lineage tab to view table lineage
- Enable the Show Columns toggle, then the Column Lineage toggle to view column lineage
Extending Data Profiling & Quality with SQL-Based Data Quality Checks (Public Preview)
Data profiling was previously introduced as a new feature in the Gravity catalog that enabled users to generate, view, and set rules for table statistics on Iceberg, Delta Lake, and Hive tables. Their columns consisted of:
- Number of rows in column
- Null percentages in column
- Unique values in column
- Min / max values in column
While setting rules around table statistics provide a degree of value in defining service level objectives (SLO) for data, decisions around consumption often go beyond rules for table level statistics for query performance optimization.
Today, we introduce the ability to use SQL to author data quality rules. This enables data teams to more effectively and comprehensively work with data consumers to ensure everyone has visibility into the health of the data, and how it matters the most. Powered by the full flexibility of SQL and enhanced Trino, data teams can author table statistics, duplicate, and lookup table checks – a small sampling of use cases. Further, as this feature does not rely on ANALYZE capabilities, SQL data quality checks can be authored for table-level entities on any data source.
The extended data quality capabilities are available without the need for additional configuration. Access control settings do apply.
To use the new data quality feature, follow these steps:
- Navigate to a table-level entity, then click on the Quality tab
- Select a cluster to run quality check queries on
- Click on Create quality check or Applied rules then Create quality check and ensure Use custom SQL quality check is selected
- Click on Test SQL to enter a query editor
- Author a query that evaluates to a boolean – see templates below
- Hint: Any SQL statement can be written and tested, but to save the statement, it must evaluate to a boolean. This enables you to author a query that returns rows when a quality check fails, then wrap it in a CTE that evaluates to TRUE if no rows are returned or FALSE if one or more rows are returned
- After you are satisfied, save the quality check – the individual check can now be run manually
This step can be repeated to create multiple checks that can be executed manually all at once or scheduled via the existing job scheduling function.
Example templates:
|
|
You have three options for subsequently running data quality checks:
- Manually executing a single data quality check from the authored rule
- Manually executing all data quality checks via the table level control
- Or via a scheduled cron job from the table level job scheduler
Schema Change Notifications & History (Public Preview)
When schema changes occur, particularly schema drift, this has the potential to impact downstream pipelines and assets, resulting in stale data, broken dashboards, and frustrated stakeholders. Further, limited visibility into the changes hamper troubleshooting efforts, often necessitating investigation of log files and code to determine what happened.
Today we introduce a set of features that provide data teams greater visibility into schema changes consisting of:
- Schema change notifications
- Schema change logs
- Daily schema snapshot comparisons
With schema change notifications, roles can be subscribed to schema or table-level entities to receive near real-time notifications whenever a potentially disruptive schema change occurs, including, but not limited to, column renames, type changes, and table drops. Notifications are sent within Galaxy with a link to a log of all changes for a particular schema. Once the changes are reviewed, users can leverage lineage to carry out an impact assessment, determining which downstream assets, such as dashboards, reports, and applications, are affected, how they are affected, and whom to communicate with.
In addition, Gravity captures daily snapshots of the schema, enabling data teams to easily resolve the common situation of “my dashboard was working a few days ago, but not today.” Data teams can carry out differential comparisons of the schema on consecutive or non-consecutive days to determine potential changes that may be impacting the data consumers.
You will need to wait until schema changes occur to see any records. To get started with these features:
- Navigate to a schema or table-level entity you have access to in the Gravity catalog
- Click on the kebab ( ) button
- Select View notifications to configure one or more roles to receive notifications
- Select View history to see the schema snapshots or change history
What’s next?
These new features are in public preview and available for you to explore after signing up and connecting a catalog to your free account! Important: Some features will only display information if transformations or schema changes have been executed.
A look ahead
As Gravity functionality improves and provides increasing value, this also means we need to ensure the value-add functionality scales to meet the growing consumption of the feature. To this end, we’re exploring dashboard-like capabilities to aggregate and review data issues at scale and integrations to improve notification and data issue management with tools you love.
For those that leverage Starburst as a part of your data ecosystem, we realize that integrating observability information and processes with the rest of the ecosystem is just as important as ensuring your lakehouse powered by Starburst is running efficiently and effectively. On that end, we are also exploring how to integrate our capabilities and metadata with other solutions that fall into the metadata management category.
We hope to be able to share more in the near future!
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).