Centralized Data

This centralized repository becomes the company’s single source of truth.

What is the difference between centralized data and distributed data?

Centralized data is the antithesis of distributed data. The former funnels data from every source into one location, while the latter leaves data at the source.

Centralized models grant corporate data teams authority over data management. In decentralized models, corporate groups set strategies and policies but leave data management to the lines of business (or domains).

A centralized repository is supposed to be the organization’s authoritative information source, but users must go through data teams for most requests. Decentralized systems let business users access data by requesting access from the line of business that is most familiar with the data and what the data can be used for.

Centralized and decentralized data strategies can co-exist

Centralize your most critical, everyday analytics data for your daily dashboards and daily reports.  These don’t change often and it makes sense to build the pipelines to pull this data together reliably and cheaply.

Decentralize everything else.  Why keep moving bits around when your boss only asks for something remotely related once in a blue moon?  Better off saving time, effort, and cost, without sacrificing access (if you have the right tool).

That’s the beauty of an open data lakehouse and the topic of discussion Justin Borgman had with Eric Kavanagh on in the latest episode of the Inside Analysis podcast. They covered a few other hot topics:

  1. The data landscape is never going to get simpler.  As Matt Turck can attest, the data landscape gets more complicated every year, a pure centralization strategy would require more and more complex data pipelines ad infinitum. 
  2. Keeping your architecture open allows you to stay agile and adaptable to the next big innovation (See bullet #1 above)
  3. Trino (and Starburst) gives you what I call the “top half of the database” — just the powerful SQL engine without the underlying storage, so you can run it on whatever data you want.
  4. For all the centralized stuff, we think Iceberg has been established as the industry standard, and Galaxy is the easiest way to get started with Icehouse architecture.

What is the benefit of centralized data?

In theory, centralization would improve a company’s data usage. Running everything in one place should give the company more control over inputs, costs, and outcomes. Centralized data’s theoretical benefits included:

Data management

A centralized data strategy was meant to be easier to manage. Corporate data teams controlled infrastructure initiatives, allowing them to align data storage with business processes and analytical workflows.

Performance and latency

Consolidating data allowed enterprises to invest efficiently in technology to optimize storage and computing performance.

Rather than waiting to retrieve data from the periphery, corporate users could get their data over the company’s low-latency networks and purpose built data technology.

Data integrity and quality control

Corporate data teams developed the pipelines for ingesting, cleaning, and preparing data entering the centralized system. Managing pipelines in one place eliminated inefficiencies and redundancies within the data infrastructure.

Data consistency

With total control, IT departments could ensure the consistency of all data in the central repository. Data pipelines ingested, cleaned, and prepared data so users could trust that similar data would share standard formats, labels, and other properties.

Organizational consistency

Giving every user access to consistent data reduced the risk that conflicting data would cause poor business decisions. For example, sales and finance departments would use the same revenue metrics in their reports. Centralized organizational structures responsible for data management enabled hyper-specialization in data technologies, and these skills to be available to all lines of business.

Streamlined collaboration

Giving business users clean, accurate, and consistent data aligned everyone in the organization. Collaboration around actionable insights became easier when everyone used the same numbers.

What are the disadvantages of centralized data?

In practice, centralization and its benefits were more an aspiration than a reality. Companies could never fully centralize their data storage. Business needs always demanded domain-level storage. As a result, centralized enterprises have always faced enormous challenges.

Loss of context

The way that finance, marketing and risk functions in a financial services organization consider a customer are fundamentally different with a different set of relevant attributes. Attempting to model the different contexts of cross domain entities in a single system is difficult and becomes more challenging with time.

Data ownership, roles, and responsibilities

Centralization was meant to bring ownership and control under one roof. However, it also created a disconnect between the application engineers creating the data, the data engineers managing the data and the lines of business consuming the data. Conflicting priorities led to inefficiencies that inhibited business growth.

The cycle of doom

Centralization also created divisions and strife between executives and their organization’s data teams. Consider what typically happened when executives needed data to support a decision:

1) Executives (or their analysts) create a data request.

2) The request joins dozens of others in the task queue.

3) Data engineers design a pipeline to fulfill the request.

4) Data engineers use the pipeline to provision data into a central location.

Even in the best-case scenario where the data was useful, it could take weeks to fulfill a decision-maker’s request — months for complex requests or when the data team was busy.

In other scenarios, the delivered data would be outdated or fail to meet the decision-maker’s expectations. Executives re-submitted their requests, demanding a better result. This request-wait-repeat cycle compromised the business in several ways.

Without infinite resources, the team allocated its limited time away from new tasks to rework failed requests. Growing backlogs worsened the data team’s support for the rest of the organization.

Decision-making became rigid, and the business became less responsive to a dynamic competitive environment. Decision-makers would find creative solutions to this problem by accessing shadow data sources. However, short-term “solutions” like this created long-term data governance problems.

Shadow IT

One consequence of the Cycle of Doom is the creation of departmental level single sources of truth, where lines of business manage and govern data outside the purview scope of the data governance function to meet their business requirements and timelines.

How Starburst helps with centralized data

True centralization never existed. Enterprise data was always distributed between the core and the periphery.

Today’s data architects recognize this reality and embrace distributed data.

Starburst empowers this decentralized strategy. Rather than struggling to create a single source of truth, companies use Starburst to give their users a single point of access to every data source.

Fast query engines drive data-driven cultures

Starburst enables lightning-fast retrieval and data analytics at scale. Building upon the open-source Trino project’s massively-parallel query engine, Starburst performance optimizations can deliver ten-to-twentyfold improvements in complex queries.

By creating a virtual data layer over a decentralized data storage infrastructure, Starburst eliminates many frustrations users experience in centralized systems.

Analysts can use the SQL tools they already know to directly access data without relying on data pipelines. This direct access lets decision-makers and analysts get the data they need without waiting for data engineers.

Since this self-serve model lets data teams spend less time on simple requests, engineers can devote more time to complex big data projects that drive business growth.

Connectivity and interoperability

Starburst brings data to center stage by pushing complexity into the background. For example, the holistic user interface masks the complexity of integrating disparate data sources.

Starburst offers over fifty connectors to enterprise-scale data systems, from data lakes to real-time data streams. These connectors transform data in each source into Starburst catalogs, translating data types and automatically handling source-to-source variations in SQL implementations.

Abstracting the complexity of disparate data sources severs the chains binding compute and storage. Domains can decide how best to manage their data stores without impacting the rest of the organization. Data teams can optimize compute to deliver affordable performance at scale.

Adaptability to regulatory requirements

Modern enterprises are storing data all over the world, catching them in a tangled web of global data security, privacy, and sovereignty regulations.

ISO/IEC 27001 certified and SOC Type 2 compliant, we understand what our customers go through to meet regulatory requirements. The Starburst data lake analytics platform includes security and governance capabilities to streamline regulatory compliance.

Create fine-grained role-based authorization policies with Starburst’s built-in access controls or through connectors for Apache Ranger, Privacera, or Immuta.

Query Starburst’s activity and change logs to monitor data access, flag unusual behavior, and conduct compliance audits.

Balancing centralized and distributed data

Just as enterprises never fully achieved a centralized data architecture, most companies will not fully decentralize. Data warehouses and data lake technologies will continue to be exist with data lakes continuing to grow in popularity. They will never be one-stop shops for all data, but data lakes will play a critical role in decentralized data architectures.

Using Starburst to bring data together in a lake where regulations allow and efficiencies can be exploited is a powerful capability. Data engineers can make the decision on when to build and maintain pipelines, when to integrate data and from which sources, and when to federate access to remote data sources and data lakes.

Embracing the decentralized data mesh

Starburst helps evolve your data infrastructure from a rigid, underperforming centralized model to a modern, distributed data mesh. Domains best understand the context and purpose of the data they generate. A mesh lets domains own data while providing a framework for sharing it with other domains.

Corporate data teams still play a role, guiding domains through high-level standards for data quality and interoperability.

By embracing the distributed architecture of a data mesh, companies can deploy a self-serve access model that empowers data-driven decision-making everywhere in the organization.

Fast, secure, and controlled access gives users the data they need to support KPIs and other business goals. By supporting better, faster decisions, Starburst creates a competitive advantage in a dynamic marketplace.