Companies use reference data, also called static or lookup data, to provide context through categorization. A common set of values, identifiers, and semantic conventions simplifies data sharing by ensuring everyone looks at data the same way.
Examples of reference data
Before continuing, a few examples of reference data should help clarify what types of data we’re talking about.
Externally defined reference data
Wherever possible, companies will adopt reference data sets defined by government agencies or standards-setting bodies.
Examples of externally defined reference data include:
- Country codes
- Postal codes
- Industry codes
Externally defined reference data facilitates working with other organizations by allowing systems to exchange similarly coded data. This standardization minimizes risks created when an organization must interpret and convert data it receives from another organization.
For instance, a manufacturer sending shipping information to its logistics provider needs to know that its products destined for Finland aren’t going to Fiji. The two companies eliminate the risk of miscommunication and expensive shipping errors by using the country codes defined in the ISO 3166-2 standard.
Internally defined reference data
Some types of reference data only make sense within a company. These fields tend to describe the company’s organization and how it structures its data for decision-making.
Examples of internally defined reference data include:
- Transaction codes
- Product codes
- Financial hierarchies and cost centers
- Location identifiers
- Customer segments
- Pricing
Consider retailers that source products from multiple suppliers. Each supplier has unique approaches to model numbering. A retailer will define product codes, called stock-keeping units (SKUs), that uniquely identify a product regardless of its source. SKUs let the retailer’s financial, e-commerce, store, logistics, and support operations look at the same products in the same way.
Static vs. dynamic reference data
Reference data becomes the basis for pipelines and other data applications that benefit from categorization’s consistency and interoperability. Ideally, these categories remain static. A change could impact to downstream applications. For example, campaign automation and other marketing processes that depend on reference datasets for customer segmentation may require extensive reworking should the company change its segmentation strategy.
Postal codes and other externally defined reference datasets tend not to change from year to year. On the other hand, internally defined reference data may be more dynamic. A company with an active merger and acquisition strategy, for example, would see frequent changes to location identifiers and other reference data due to organizational restructuring.
As a result, managing reference data is never a one-and-done process.
Terms related to reference data
Reference data management (RDM)
Reference data management is the practice of defining and maintaining reference datasets. Although relatively static compared to business data, reference data’s fundamental role means any changes produce disproportionate impacts. Besides potentially disrupting business workflows, inaccurate reference data could compromise data governance and compliance.
Reference data stewards and custodians
A data steward is responsible for deciding what reference data their company should use, choosing the appropriate values for a reference dataset, and defining how the organization should use it.
The data custodian is an information technology role responsible for safely storing and transferring reference datasets.
Mappings
Mapping one reference dataset to another can improve operations. For example, e-commerce companies shouldn’t expect consumers to know that BA is the ISO country code for Bosnia, not Barbados. Populating drop boxes with a secondary reference dataset of human-readable country names mapped to ISO codes creates a better user experience.
Private reference data
In the wrong hands, some internally defined reference datasets can reveal insights into a company’s structure and processes that would inform cyberattacks or industrial espionage. Classifying these datasets as “private” allows for more rigorous data protections. Levels of classification let security teams distinguish between internal-only datasets, datasets shareable with third parties, and public datasets.
Master data management vs. Reference data
Master data is another way to bring context to business data, in this case by describing business entities such as customers or suppliers.
Unlike reference data, master data is dynamic, but its changes do not directly impact business processes. For example, adding a new supplier does not change supplier management workflows while recategorizing suppliers would impact reporting and other data applications.
Like reference data management, master data management (MDM) encourages consistency and accuracy in business activities. For example, all departments should reference the supplier master data rather than maintaining their own supplier databases.
Transactional Data vs. Reference data
Transactional data describes exchanges and events. Invoices, payments, network activity, and temperature readings are a few examples of the many business transactions companies record.
Online transactional processing (OLTP) systems will use reference datasets in their validation workflows. For example, a financial institution’s reference datasets would define acceptable transaction sizes for money transfers, allowing OLTP systems to flag violations for further investigation.
Metadata vs. reference data
Reference data helps categorize business data, while metadata describes business data. Examples of metadata include table sizes, maximum values, provenance, and lineage.
Metadata management ensures the consistency and quality of the company’s metadata. Query engines, for example, rely on metadata to improve performance, especially when operating at the scale of big data.
Analytical data vs. reference data
Reference data brings deeper context to analytics by making it easier to connect disparate data sources. Because every domain, geography, and business function will use one set of common reference data values, combining multiple datasets requires less processing and fewer assumptions. Business intelligence analysts can uncover deeper insights faster to support data-driven decision-making.
Reference data and data integration
Data integration becomes a growing challenge as companies adopt increasingly distributed information architectures. Data teams in centralized architectures own every data source. They control design, implementation, and governance. Managed well, a centralized architecture uses reference data to streamline the integration of different datasets.
In distributed architectures, data ownership shifts to domains and external service providers where design and governance decisions get made. Two closely related datasets living in different parts of the organization will incorporate different assumptions, quality standards, and other factors. The central data team no longer has the domain expertise to understand these sources. As a result, integration becomes a more protracted process requiring the participation of multiple groups.
Data management and governance best practices can reduce this friction by, for instance, creating universally accepted reference datasets. As long as domains use reference data consistently, data engineers can develop more effective ETL pipelines that evaluate and transform data before it lands in a data warehouse. However, decentralized ownership still imposes coordination penalties that throttle analytics.
Managing reference data with data virtualization
Data virtualization underpins a new approach to big data analytics — the modern data lakehouse. Recognizing that a single repository can never contain all the data a company needs to analyze, the modern data lakehouse connects and abstracts all enterprise data sources within a unified virtual access layer.
Virtualization eliminates the friction imposed by integration pipelines in distributed data architectures. Instead of consolidating data into a warehouse for future analysis, the modern data lakehouse integrates data at query runtime. From the user’s perspective, all the data appears to be in one place even though it resides on systems distributed across the enterprise’s IT infrastructure.
Starburst’s modern data lakehouse analytics solution virtualizes data infrastructure and makes it accessible through a single interface without the requiring extensive data movement or replication. Building on Trino’s open-source SQL query engine, Starburst queries can pull data from any source or push down queries to run where the data lives, thus balancing performance and cost.