Data Observability
Pillars of data observability
Adopting a data observability framework lets data engineering teams implement the technologies and processes needed to maximize the quality of enterprise data assets.
One such framework consists of the following five pillars of data observability:
1. Data quality
High-quality data is essential to data-driven decision-making. Data analysts depend on access to quality data sources to create the dashboards and other products company leadership relies upon to manage the business. Without high-quality inputs, data scientists waste time preparing data for machine learning and artificial intelligence projects.
Observability’s data quality pillar describes the state of an organization’s data sets. Data quality monitoring systems generate statistics about the contents of every data table, including:
- Volume: How many rows and columns are in the table?
- Timeliness: How recently and frequently are data tables refreshed?
- Range: What are the maximum and minimum values in each table?
- Errors: How many NULL values appear in the data set?
High-quality data is essential to effective decision-making. By identifying data quality issues quickly, observability tools jumpstart the issue resolution process.
2. Data governance
Data governance is the set of policies that aligns data management with business strategy. These policies include the criteria observability systems use to evaluate data quality statistics. Do the values within a table fall within an acceptable range? Is the table’s volume larger or smaller than expected? Observability rules based on governance standards will quickly identify data quality issues.
Observability policies can also identify secondary effects of poor data quality. For example, a change to the organization of data tables could signal an underlying issue. Data observability solutions will monitor for schema changes, generate alerts, and identify who made the change. An investigation can then determine what underlying issues drove the change and whether the revised schema impacts other workflows.
3. Data lineage
Data lineage visualizes the metadata generated as data flows through data architectures from its source to each downstream data product and repository. This metadata not only describes where and when data travels but also how pipelines transform the data along the way. Data lineage plays a critical role in troubleshooting data quality issues, discovering their root causes, and understanding their downstream impacts.
Data engineers use lineage tools to conduct a root cause analysis that identifies where an issue appears in the data lifecycle. In addition, they can see any dependencies further downstream that the bad data could impact. By quickly identifying the source and impact of an issue, data lineage contributes to a quick resolution and minimizes data downtime.
4. Data reliability and data pipelines
Data integration depends upon extract, transform, and load (ETL) and extract, load, and transform (ELT) pipelines to connect data warehouses, data lakes, and other repositories with their sources. However, the proliferation of sources, repositories, and data products has turned pipeline maintenance into a significant burden for data management teams. As data reliability declines, consumers lose trust in their data products, undermining business decision-making.
Data observability platforms let engineers manage data health across hundreds or thousands of pipelines. With enhanced visibility and scalable automation, data teams can evaluate the quality of every pipeline’s inputs and outputs as well as the quality of transformations within the pipeline. Streamlining pipeline maintenance improves data reliability and reduces data downtime.
5. Data monitoring
This fifth pillar of data observability comprises the data monitoring systems that enable rapid anomaly detection and mitigation. Profiling through machine learning models allows these systems to distinguish good data from bad. Automated notifications and data health dashboards surface issues and help data teams prioritize their responses.
What data observability is not
The term “data observability” is useful for understanding the parallels between this emerging DataOps technique and the more established DevOps practice of application observability. Developers use logs, metrics, and traces to monitor the state of applications and microservices in the modern data stack and resolve issues quickly. Although analogous, data observability is not the same since it focuses on the quality and reliability of an enterprise’s data.
In addition, some traditional data management practices, such as data testing and pipeline monitoring, have similarities to data observability. They differ in their ability to provide holistic, scalable methods for managing data quality.
As unit testing lets DevOps teams evaluate code, data testing lets DataOps teams assess the impacts data issues may have downstream. By its nature, data testing can only evaluate known issues. The practice cannot predict or identify unknown causes of poor data quality. Scope is another way testing differs from observability. Engineers typically conduct tests while developing a pipeline or a step within that pipeline. Their testing only partially assesses the potential impact of data issues within the context of the entire data ecosystem. Observability provides scalable, end-to-end visibility into data quality throughout the data lifecycle.
Another traditional data management practice with similarities to data observability is pipeline monitoring. This technique evaluates whether a pipeline performs its extract, load, and transform functions correctly. However, pipeline monitoring tools may not always assess data quality throughout the pipeline. A perfectly functioning pipeline will output poor-quality results from poor-quality inputs. Unlike this traditional approach, observability systems monitor the data entering, flowing through, and exiting pipelines for quality issues.
Why is data observability important?
Data has become the engine of modern business, which places a premium on data quality. Beyond the impact of chronically poor data quality on decision-making, each quality incident further disrupts the business.
Data downtime, the period between the occurrence of an issue and its resolution, can render data products unusable. Furthermore, downtime consumes a data management team’s limited resources, limiting its ability to support business objectives.
With data volumes and complexity increasingly outpacing traditional data management tools, enterprises have adopted more scalable observability practices to improve the quality of their data systems.
How is observability done today?
Data observability is an emerging practice with varied modes of implementation. Initiatives may adopt any combination of the following approaches.
Internal development
Given the emerging nature of data observability practices, existing solutions may not meet the needs of large organizations with complex data architectures. These companies will develop their own observability systems to monitor and address data quality issues. While this approach provides control over the observability system’s capabilities, it requires a long-term commitment to development and maintenance.
System integration
Third-party data observability tools may not provide complete end-to-end solutions to a company’s unique data quality needs. Integrating tools from multiple vendors can complement the company’s existing data infrastructure. However, system integration adds its own complexities and expenses.
Data platform extension
Some data platforms and cloud services offer built-in observability features that work seamlessly within their solutions without additional development or system integration. These features may not be as comprehensive as those offered by specialized observability solutions providers, yet they may sufficiently enhance data quality with minimal effort.
Starburst data observability tools
Starburst Galaxy is a modern data lake analytics solution that federates disparate data sources within a virtual access layer. Starburst Gravity adds a management layer that streamlines discovery, access control, and governance. Gravity now includes data observability features that let data teams:
- Evaluate the state of data lake tables.
- Define pass / fail data quality rules.
- Identify and investigate data quality issues.
Gravity’s data profiling features generate statistics about the tables in Iceberg, Delta, and Hive data lakes. Engineers get one-click insights into data volumes, NULL and unique values, and data value ranges.
Gravity’s data quality criteria automate the evaluation of data profiling statistics. Data teams can define single or multi-column rules, and Gravity will generate alerts with each profiling event.
Gravity’s data lineage capabilities provide the tools engineers need to investigate these events. Visualizations of data flows within a data ecosystem map transformations leading up to and cascading downstream from a quality issue.
Individually, these features enhance data team productivity. Taken together, Gravity’s data observability capabilities empower holistic data quality monitoring and issue resolution workflows.