Data Quality

Data quality powers many important data analytic dashboards, data products, and AI/ML models, and that’s why the concept of “garbage in, garbage out” is more important than ever. This means that real-world, data-driven insights are only as good as the data.

In a recent report, Forbes and KPMG reported that 84% of CEOs were concerned with the quality of data that they used to form their decisions. Today, data quality remains a top executive concern for those striving to build a data-first organization. With unreliable data, the bottom line is that businesses will have a far more difficult time deriving business value from data across all industries: financial firms, healthcare, telecommunications, supply chain, etc. The truth is that while it’s easy to embrace the idea of being data-driven, it’s far more difficult to execute on that commitment to meet company key performance indicators(KPIs).

More often than not, when key stakeholders receive a report, they’ll find and declare:

  • Data is missing from a critical report. 
  • Is the data accurate?
  • Data wasn’t quite what they requested.
  • Does the data set include everything it needs?
  • There’s duplicate data.
  • Data consistency isn’t present.
  • How recently has the data been updated?
  • Bad data: The numbers are inaccurate.
  • Does the data comply with the company’s formatting and other standards?

    Experienced data executives and business leaders will only adopt data-driven solutions if they trust and have confidence in the data. What does this look like? They will likely review the results of their dashboards and reports for consistency with other sources of data. In essence, this close eye to detail is manually validating the data.

    Let’s take a closer look at why data quality remains a top concern for many executives (and data engineers) today.

    Why is data quality important?

    Data analysts, scientists, and engineers responsible for producing data-driven reports have an important role in data-driven organizations. Engineers build data pipelines and analysts run queries to produce reports to business decision makers. Data quality is also essential to AI and machine learning initiatives.

    As such, they’re managing and ensuring data integrity, trustworthiness, uniqueness, accessibility and usability at every stage of the process, at scale: with complex data pipelines, before data ingestion, or in data analysis. Data quality powers many dependencies and even determines the health of your data lifecycle. Single-use data typically has a shorter lifecycle as it’s marked as stale more quickly, which eventually leads it to being archived or deleted. To prolong the use of data, increase its reusability, which also is a sign of data trustworthiness, confidence and quality.

    Data quality management challenges for data engineers

    Data engineers face constant challenges driven by the rate their organizations generate and acquire data. They must scale infrastructure to keep pace and curate an ever-growing catalog of ETL pipelines. Crucially, managing data quality across its multiple dimensions presents data engineers with a double challenge.

    First, data quality is complex. Data engineers must understand a data element’s domain context as well as how data consumers will use the data. They must conduct data quality assessments across the five dimensions of data quality. In addition, engineers consider governance topics such as access control, data privacy, regulation, and security that are becoming critical data quality criteria. With data profiling complete, data cleansing makes it fit for purpose before making it accessible to data consumers.

    The second challenge is that data quality management is mission-critical. With enterprises adopting data-driven cultures, the quality of the company’s data determines the quality of business processes and decision-making throughout the organization. Data consumers must know, without a shadow of a doubt, that they can rely on the data they use. So the pressure is on data engineers to get things right.

    How data governance and data quality work together

    As companies adopt distributed data architectures, domain owners take responsibility for the quality of the data sources they maintain. Each domain has the closest understanding of the data, what quality means in their context, and how best to manage data quality.

    Data governance ensures the domains follow consistent data quality policies. Governance standards won’t dictate how a domain handles data entry, maintains data records, or keeps a data set current. They will require the domain to measure and report timeliness so users in other domains understand how to use the data.

    What are the 5 factors of data quality?

    Every organization must develop its own criteria for what data quality means as part of the governance process to prevent data quality issues. Generally, these can be categorized into the following five qualities of data: accuracy, completeness, reliability, relevance, and timeliness.

    #1 Accuracy: data that can be used as a reliable source of information

    Good data quality entails reliable and accurate information. Data accuracy means that it limits bad data and that data won’t be misinterpreted or riddled with errors. The information is comprehensively delivered, such that users won’t mishandle the data.

    #2 Completeness: data that is available to extract maximum value

    Ensure that information is comprehensive, accessible, and available for employees. With inconsistent data or inaccurate data, organizations would not be able to use the information to their fullest potential.

    #3 Reliability: data that doesn’t contradict to minimize margin of error

    Data shouldn’t be contradictory to various other data sources. If the information were to overlap, then it would create inefficiencies or worse, make becoming data-driven far more difficult.

    #4 Relevance: data that is applicable to the organization’s needs and objectives

    Understand the objective and the significance of collecting data. Collecting relevant information and data decreases the costs and increases profitability for organizations. Meanwhile, irrelevant data adds noise.

    #5 Timeliness: up-to-date data generates relevant data-driven reports and applications

    Maintaining updated data is critical for modern applications and reporting. Obsolete data makes data-driven reports untrustworthy and unreliable.

    3 Current data management trends bring renewed interest in data quality

    Enterprises are outgrowing the limitations of traditional data warehouses as the era of Big Data accelerates. To optimize their complex data environment, these companies are adopting new data management concepts such as the data mesh, data products, and modern data lakes.

    Data quality becomes even more critical as these trends take hold. Let’s look at each of these trends, the relevance of data quality, and the role of Starburst’s data lake analytics platform.

    #1 Data Mesh empowers data owners and data stewards with high-quality data

    Data Mesh is a decentralized sociotechnical approach in managing and accessing data at scale. Rather than data silos, the organizational culture can shift towards data management that is supervised by domain-owners as they are closest to the business.  Domain and data product owners will oversee data quality and as a result, data quality is no longer an isolated, independent process, but nurtured by domain owners and upvoted by other data consumers because they have confidence in the data product.

    The main aspect of Data Mesh that fulfills data quality is the fourth pillar: federated computational governance. Shifting from a centralized to a decentralized architecture, data governance is imperative to maintain the balance and requirements of governance. Governance ensures that data is given to users on a need-to-know basis. This ensures that the data will comply with required regulations while providing high quality, reliable, and accessible data to authorized end-users.

    Data Mesh enables the responsibility with the federated governance model by applying global policies throughout the domains. Computational governance is the enforcement to maintain the balance between interoperability and global standards. Domains adhere to a certain set of practices and standards to apply governance.

    Modern technologies like Starburst’s analytics platform support the Data Mesh’s federated architecture. For example, by creating a single point of access to enterprise data, Starburst eliminates data duplication — a critical goal of many governance systems concerned with customer data and regulatory compliance. Standards-based integrations with lineage products, data catalogs, and other governance tools let data teams use Starburst to manage their governance efforts.

    #2 Benefits of a modern data lake (analytics) and how it can improve data quality

    While the data mesh is still an emerging architecture, data lakes are much further along the adoption curve. Data lakes eliminate many of the costs and frustrations that come with data warehouse management. They serve as a single repository for all structured and unstructured data the company ingests. The data lake simply stores raw data without any consideration for how the data will be used. This philosophy implies there’s no need to clean, transform, or otherwise process the incoming data — all that gets addressed in the pipelines feeding data assets.

    In practice, data quality always matters. A data lake generates several benefits that ultimately depend on the quality of its data, including:

    Single point of access

    Data consumers in every domain rely on the data lake to conduct their analysis. The more data must be pre-processed, the longer it takes to get the insights they need.

    Single point of governance

    Data lakes centralize data collection while decentralizing data analysis. Addressing quality at the edge requires enforcing governance within data products or as consumers use the product. Addressing quality at ingestion reduces governance overhead.

    Advanced capabilities

    Data products can deliver warehouse-like capabilities by drawing directly from the data lake. Building those capabilities on poor data quality (i.e. data swamp) imposes a development and maintenance burden on data teams and their consumers.

    Self-serve data lake analysis

    Another advantage of filling data lakes with high-quality data is the ability to push self-serve analysis throughout the organization. Business intelligence analysts are not data engineers. They can’t cleanse data themselves, so must trust that the data they use meets the company’s data quality standards. Engineers must implement data quality measures whenever they ingest different data into the lake.

    With a quality data lake, Starburst’s SQL query tools let data consumers tap into the data lake directly without the data team’s help. Drawing on a quality data lake, these analysts can speed time to insight and support more agile and effective business decisions.

    #3 Data products, the future of data consumption

    Whether an organization is ready for the advanced concepts of data lakes and data meshes, treating data as a product is a trend with universal appeal.

    When data engineers and data scientists adopt a product management mindset, they focus on their customers to create data products that meet particular use cases. Data teams get input from these stakeholders to define the data product’s specifications and development schedule. Then they create standards-based designs that support a self-service usage model, allowing people to repeatedly use the data product without help from the data team.

    Traditionally, much of the work that goes into designing a data asset is spent searching for, transforming, and consolidating various data sources into a data set specific to that asset.

    Starburst will streamline that process. Directly accessing data on-demand eliminates ELT pipelines and other infrastructure that data teams spend so much time maintaining. Rather than creating a one-off data set that duplicates data, Starburst lets you leave data at its source and query it closer to real-time.

    Starburst also makes it easier for data consumers to discover and use their company’s data products thanks to its built-in workflows, consumption layer, self-service platform, and SQL query engine.