Single Source of Truth: Open data lakehouse

For decades, organizations have sought to create a single source of truth that can feed spreadsheets, dashboards, and business intelligence apps to make better, more consistent, data-driven decisions. However, that goal has proven elusive. This article will examine SSOT’s idealistic goals and harsh realities as well as how federating data architectures into a single source of access comes closest to that unified data paradise.

SSOT in Data Management

An SSOT’s value to analytics is an intuitive concept. When stakeholders across the organization use the same piece of information, fully understanding its provenance and underlying assumptions, they are less likely to miscommunicate. For example, Sales, Finance, and Compensation teams must use the same data points to calculate performance metrics, report revenue, and pay commissions.

A single source of master data would benefit other areas of the business, particularly data management.

Data Quality and Accuracy

In theory, an SSOT simplifies data quality management. Data integration pipelines only need to touch data once to address discrepancies, inconsistencies, and other quality issues. Clean, accurate, and consistent data reduces engineering teams’ workloads. For example, when the SSOT’s data formats and values meet users’ expectations, engineers spend little time developing additional processing pipelines.

In practice, ingestion processes are more complex and users’ expectations are less consistent. Data sources can change schemas without notice, forcing engineering teams to spend considerable time maintaining pipelines into the data warehouse. Data consumers’ requirements always require additional processing, adding demands on engineer’s busy schedules. Furthermore, meeting data quality standards is time-consuming, so decision-makers receive old, obsolete data.

Centralized data management

Centralizing data management within an SSOT should simplify workflows by creating a standard reference for all users and applications in the organization. The complexities of heterogeneous data architectures get addressed at ingestion. Once in the SSOT, data is ready to use with a shared understanding of provenance, lineage, formatting, and other properties. This reliability would let data teams apply API-based software development practices and automations to streamline data flows throughout the organization.

However, this consolidation has never been possible. The predefined schemas of data warehouses worked when most data sources output structured data. Business insights are more likely to come from analyses of unstructured data that warehouses cannot handle. Data lakes promised to hold any type of data at internet scale, but enterprise data’s volume, velocity, and variety makes storing everything in one place cost-prohibitive.

As a result, developing and maintaining complex pipelines continues to be as much an art as a science. Engineers must correctly interpret data consumers’ requirements — assuming, of course, consumers truly know their requirements — and then build models to blend data from the warehouse and other sources. This labor-intensive iterative process as consumers refine their requirements in response to each dataset.

Regulatory Compliance

Improved data governance and regulatory compliance is another SSOT promise that never quite pans out. A central repository for all data allows for a central governance regime controlling access. An SSOT would clearly define the location, sensitivity, and regulatory exposure of every piece of information it stores. Governance systems would then assign permissions limiting data access to only authorized users. For example, governance policies would apply GDPR-compliant privacy policies to customer data collected in the European Union.

An SSOT would also simplify compliance monitoring and auditing since data would be accessible in only one system that generates access and activity logs to describe the organization’s compliance status.

In reality, governance and compliance are not that easy. Data ownership in mid-to-large enterprises resides within central data teams, business domains, and geographical regions. Organizational resistance, geographical limits, and performance constraints work against monolithic systems. Real-world governance requires coordination between centralized and decentralized structures.

Centralized and decentralized data strategies can co-exist

Although the platonic ideal of a single source of truth is impossible to achieve, the benefits it promises are worth pursuing. Strategies can balance the advantages of centralized data with the realities of decentralized infrastructure and ownership to streamline workflows, optimize data engineering practices, and enhance analytics.

Data warehouses vs. data lakes

Most companies’ first attempt to create a single source of truth is a data warehouse. As discussed earlier, warehouses cannot store the variety of data businesses today use to drive decision-making. The warehouse’s inflexible predefined schemas resist evolving with changing business priorities. Finally, many data warehouse implementations depend on proprietary solutions that lock companies into expensive, difficult-to-scale technologies.

Data lakes promised to address data warehouse weaknesses by creating a central repository of all types of data running on commodity cloud storage services. However, the data lake’s rudimentary metadata and lack of analytics required a patchwork of additional technologies. Secondary data warehouses would copy data from the data lake to make analytics accessible — creating multiple sources of “truth” in the process.

Data meshes decentralize data and governance

An alternate path is the data mesh, a relatively new architecture that pushes data ownership outwards to the business domains. These domains treat their data as products for use by the rest of the organizations. Published APIs make the domain’s data discoverable and document properties like lineage and quality. A self-serve model lets data consumers integrate these APIs to query data where it lives without complex integration pipelines.

Governance in a data mesh occurs through an automated, federated model. Central governance teams and domain owners share policy creation responsibilities. Enforcement happens where data lives in the domain as users access data through the API.

Implementing data mesh initiatives requires challenging transformations, which makes buy-in from all stakeholders essential. Furthermore, data meshes are unique to an organization and need significant custom development to implement.

Open data lakehouse architecture

An open data lakehouse is a more straightforward way of balancing centralized and decentralized data. This architecture combines the flexibility and scalability of a data lake with the analytics performance of a data warehouse and the federated structure of a data mesh.

Running on the cloud-based object storage of a service like Microsoft Azure Blob Storage, open data lakehouses leverage the synergy between Apache Iceberg’s open table format and Trino’s open-source massively parallel processing SQL query engine. Iceberg provides the metadata Trino needs to run complex queries at petabyte scales. In addition, Trino’s connectors integrate disparate enterprise data sources into a federated architecture of centralized and decentralized data.

Open data lakehouses give engineering teams the best of both worlds. They can consolidate critical, frequently accessed data into the lakehouse’s object storage. At the same time, all other data remains instantly accessible where it lives in the wider data landscape.

Here’s how Trino/Starburst facilitates establishing the lakehouse as the single source of truth (SSOT)

Starburst’s Trino-based open data lakehouse platform becomes a virtual single source of truth for data across the enterprise.

Data Integration

Starburst’s enhanced Trino connectors let companies integrate enterprise data sources like relational databases, data lakes, warehouses, and data streams within a virtual access layer. Data silos vanish as various systems, once hidden, become visible.

Schema discovery and other automated features generate the metadata needed to create a consistent user experience and make data discoverable through Starburst’s unified interface.

Near Real-Time Queries

From fraud detection to patient monitoring to cybersecurity, the accelerating speed of business decision-making requires nearly instant access to data. The volume of small, unstructured data files streaming from real-time systems overwhelms traditional batch-based ETL pipelines.

However, Starburst can ingest streaming data into an open data lakehouse. Streaming ingestion makes data from Kafka-compliant topics accessible to near real-time applications and lets users query the most up-to-date operational data.

Federated Queries

By integrating a company’s disparate data sources, Starburst’s connectors let users run a single query that returns data from multiple systems. Federated queries simplify data ingestion pipelines and reduce engineering teams’ maintenance burdens.

Thanks to the accessibility of Trino’s ANSI-standard query engine, Starburst lets users access data from varied sources with the tools they already know. They no longer need help for every analytics project, which frees engineers to support more complex projects.

Data Governance and Security

Even though the open data lakehouse leaves a lot of data on other systems, Starburst’s virtual access layer and rich metadata let data teams implement fine-grained access controls to enforce governance and regulatory compliance across the organization. For example, European customer data will remain on European servers. Only authorized users in Europe will get access to an individual’s protected data, while users in the United States may only receive aggregated data.

Performance and Scalability

A Starburst open data lakehouse decouples compute from storage, making data architectures easier to scale while balancing performance and cost. Federation reduces storage costs by reducing data motion and duplication. Significant amounts of data remain at the source, making object storage expansion more predictable.

Starburst’s performance optimizations accelerate complex queries and reduce network overhead. Pushdown and other features run queries where the data lives to minimize data motion. Features like smart indexing and materialized views reduce the compute demands of everyday data use.

Single

Source

of

Truth

(SSOT)

Related blogs

SSOT in Data Management

Data Quality and Accuracy

Centralized data management

Regulatory Compliance

Centralized and decentralized data strategies can co-exist

Data warehouses vs. data lakes

Data meshes decentralize data and governance

Open data lakehouse architecture

Here’s how Trino/Starburst facilitates establishing the lakehouse as the single source of truth (SSOT)

Data Integration

Near Real-Time Queries

Federated Queries

Data Governance and Security

Performance and Scalability

Related blogs

Cookie Notice

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Analytical/Performance Cookies

Functional/Preference Cookies

Targeting/Advertising Cookies

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

Single Source of Truth (SSOT)

Related blogs

SSOT in Data Management

Data Quality and Accuracy

Centralized data management

Regulatory Compliance

Centralized and decentralized data strategies can co-exist

Data warehouses vs. data lakes

Data meshes decentralize data and governance

Open data lakehouse architecture

Here’s how Trino/Starburst facilitates establishing the lakehouse as the single source of truth (SSOT)

Data Integration

Near Real-Time Queries

Federated Queries

Data Governance and Security

Performance and Scalability

Related blogs

Cookie Notice

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Analytical/Performance Cookies

Functional/Preference Cookies

Targeting/Advertising Cookies

Single

Source

of

Truth

(SSOT)