Data Lineage

It involves detailed metadata management that records and visualizes the journey of data elements—from their sources, through the different transformations they undergo in various systems, to their final consumption points.

Data is the lifeblood of successful organizations. It powers operations, from manufacturing to customer service. Strategic decision-making is only possible with the insights companies can plumb from the depths of their data lakes. The question every business analyst, every executive, and every director must ask is, how reliable is our data?

Enter data lineage. These data management processes chart the flow of data from the point of generation to the dashboards, reports, and analyses that inform decision-making across the organization. Data lineage ensures the quality of data stored in the company’s data warehouses and provides the context business users need to use data correctly.

What is data lineage?

Detailed metadata lies at the center of data lineage systems, tracking data and its transformations through the company’s information pipeline. Visualization tools and other methods display where data elements come from, how one system after another transforms them, and where data is ultimately consumed.

Having this global view of enterprise data flows provides traceability and context. Data lineage tools can trace errors to their root cause. For example, an engineer can investigate where a new ETL pipeline went wrong. Also, business users can make better data sourcing and processing choices by placing data in context — where it came from and how it was generated.

The sheer volume of data flowing through modern companies every day, combined with the many uses data-centric organizations have for that data, drives complexity and risk.

One of the biggest risks of this complexity is regulatory compliance. Data going to the wrong place or being used inappropriately could have significant consequences. For example, violating data privacy or data protection regulations like the European Union’s General Data Protection Regulation (GDPR) can result in hefty fines. Without accessible explanations of data flows, demonstrating compliance to auditors requires enormous investments in time, money, and resources.

Understanding data lineage (and data provenance)

Provenance and lineage are distinct but closely related terms. The former describes something’s origin, while the latter describes the journey to its destination.

In data management, provenance is the metadata that describes where data was generated and its structure, format, and other characteristics. Lineage is the metadata generated as this raw data flows from its source, through intermediate databases, into data warehouses or data lakes, and onwards to business reports and analysis. Lineage also includes metadata describing the data transformations throughout this data lifecycle.

Data lineage vs. data governance

The set of policies and processes that define how an organization manages data is called data governance. By following these rules, companies ensure their data’s quality and security. Data lineage is the means for evaluating data governance. These tools let businesses confirm that where data resides, how it is changed, and where it flows complies with governance rules.

Data lineage vs. data classification

Data classification assigns labels to data based on some criteria. For example, a company may create security classifications to distinguish between public, confidential, and regulated data. Data lineage can enhance classification by identifying where each type of data is being used.

Data lineage use cases

Implementing a data lineage solution supports the entire organization. These initiatives support anyone managing, analyzing, or using data. Among the most important use cases are: auditing, security, and compliance; data migrations; data quality; and data-driven decision-making.

Auditing, security, and compliance

Data lineage’s ability to deliver granular views of how data flows through an organization is essential to its security and regulatory compliance programs.

Consider a company that serves the US healthcare industry. HIPAA regulations require it to protect patient information from unauthorized access. But many data systems depend on access to databases that hold medical records. Compliance managers can use data lineage tools to verify that protected information is not leaking beyond secure systems.

Any company doing business in Europe faces similar data privacy demands from GDPR. Data lineage solutions can reduce compliance risk by identifying opportunities for data minimization across its data assets.

Migrations

System migrations are fraught with data challenges. Which data matters and which is irrelevant? What dependencies need to be replicated? What impacts could the new system have on business users?

Data migration projects informed by data lineage run much smoother. These solutions identify where the new system needs to source data, whether deleting obsolete data is possible, and how to prevent disruptions to users of the new system.

Data quality

Data lineage plays a significant role in data quality. Data engineers can explore how and where data transformations occur, identify changes and who made them, and verify that data is managed within the company’s governance policies.

When errors do occur, data engineers can follow, step by step, how and where the error entered the data lifecycle.

Data-driven decision making

Without a data lineage system, business intelligence analysts and data scientists will waste resources on time-consuming data discovery and processing tasks that could be better spent generating insights to support the business.

Data quality improvements give business users confidence in the datasets they plan to use. Data mapping and context speed the data discovery process, helping these stakeholders find the right data elements for their work.

As better data management improves analytical quality, the company’s leadership can make better, data-driven decisions.

The lifecycle of data: How lineage impacts data management

From one end of the data lifecycle to another, data lineage improves the quality and integrity of the data driving daily business decisions.

1. Data catalogs

Data systems generate metadata documenting every time a record moves or changes. This is the raw data that contributes to data lineage. A best practice for any end-to-end lineage solution is to use automated data catalogs to gather this metadata wherever it appears.

2. Data ingestion, data pipelines, ETL

As systems generate or ingest data, everything flows into data assets across the IT infrastructure. Extract, transform, load (ETL) and extract, load, transform (ELT) pipelines convert raw data, sample real-time data streams, and query databases to fill a data warehouse or data lake. At each stage, automated data lineage tools verify data moves from the right source to the right location in the right manner.

Column-level alterations in databases, differing transformation processes, and other variations can break ETL and ELT pipelines. Data lineage helps developers maintain these data pipelines by identifying changes that could compromise pipeline output.

3. Data processing

Data integration and processing produce higher-order datasets. Whether these datasets are reliable or useful depends on the data quality entering the system. Likewise, the integrity and accuracy of the system’s output depend on whether the data is processed correctly in compliance with governance rules. Data lineage solutions provide useful checks to improve data processing quality at each step.

4. Data warehouse and data lake

Data lineage relieves many of the headaches IT teams suffer as they manage data warehouses and data lakes. Clear visibility into each data element’s provenance and context simplifies data mapping. As mentioned earlier, creating and maintaining the pipelines requires less effort. And data lineage supports compliance efforts by tracking where data goes and how it is used.

5. Querying of data

Whether it’s data scientists developing big data machine learning algorithms or business intelligence analysts running reports for department heads, every SQL query to the data warehouse could create a new dataset and extend the data’s lineage further.

Automated data lineage solutions ensure queries happen within governance and compliance policies. Should a violation occur, data lineage tools can facilitate audits by identifying the incident’s who, what, when, and how.

6. Data consumers

A company’s data exists in an interconnected system of systems. A change in one place can ripple upstream and downstream to impact stakeholders throughout the organization.

Data lineage solutions provide the visibility needed to conduct an impact analysis of a proposed update. This analysis documents how the initiative will affect data consumers. For example, downstream systems expect to receive data with specific formats. Lineage lets the IT team identify and account for these dependencies before implementing the updated system.

Better data yields better analysis for better decisions

Starburst Galaxy and the Starburst Enterprise Platform accelerate time to insight by unlocking the data once confined to silos in your organization. Our solution also supports your data governance, security, and compliance efforts. For example, integration with Apache Atlas lets you track sensitive data so you can implement appropriate controls wherever the data goes.

Discover how Starburst enables data-driven decisions by signing up for a free Starburst Galaxy starter account.