How to do data centralization the right way
Share
More deployment options
Is data centralization the new black?
For a long time, data centralization was the default option for all big data projects. In fact, it was often the only option for most data warehouses and data lakes embarking on data integration and automation workflows. This monolithic approach often created more problems than it solved. Against this backdrop, data decentralization grew in popularity as companies sought to unburden data engineering teams of their mounting backlogs. Unfortunately, implementing full data decentralization can be difficult without the right tools and requires the right organizational approach.
The truth is more nuanced. While data centralization can be the best choice for some use cases, what’s needed is a flexible, modern data platform that allows you to have the best of both worlds.
In this article, we’ll look at how to do data centralization the right way. This means taking a balanced approach between decentralization and centralization. We’ll look at how to build a data architecture that gives you the flexibility to implement it in an iterative, evolutionary fashion so you can reap its benefits without stumbling across its roadblocks.
Why data centralization still matters
Let’s start from the top. How do you achieve data centralization, and what does that do for your data strategy?
Data centralization is a data architectural strategy that involves copying data from its original location into a single, central data source. This centralized system then becomes the single source of truth for both data analytics and Artificial Intelligence (AI) workloads.
To achieve this, data engineers construct data pipelines to extract, load, and transform data from different sources, often harmonizing heterogeneous data structures into a single structure as they go. These processes are run on a regular schedule to ensure data freshness, cleaning, and transforming data along the way.
The benefits and challenges of centralization
Although it’s not always easy to achieve, when approached in moderation, data centralization provides numerous benefits, including:
- Faster queries. Even with excellent connectivity, it’s hard to compete with the speed inherent to having all your data in one physical location. Using some centralization in your data stack makes cross-table and cross-system joins faster, reducing the latency inherent in federated queries and creating opportunities for optimization. This need has only increased with the rise of real-time analytics.
- Discoverability Despite the difficulties of centralizing all your data, having some data held in a single source of truth makes things simpler. Specifically, it improves data discoverability and data access by giving data users a single location to search, reducing the risk of creating data silos.
- Collaboration. Depending on the distributed data architecture you’re using, you might be storing the same data set in multiple locations. This raises the risk of introducing discrepancies. For example, two teams might calculate monthly sales using different logic or data structures. Having an official source of truth for a dataset makes collaboration more manageable and improves decision-making, as you can ensure that version control issues don’t arise. Modern lakehouse table formats like Apache Iceberg help manage this in distributed systems. While this can be used to decentralize as much as you want, it can also be used to centralize in some scenarios. In both cases, the amount of centralization and decentralization is a choice that you have control over, not a rigid constraint imposed upon you.
- Better governance. Centralization reduces the risk of rogue data sets. Data that isn’t highly siloed can more easily detect and resolve security and compliance issues. Centralizing certain datasets where governance matters can make sense, and a data lakehouse can help achieve that without the problems of the centralization of the past.
How data centralization got a bad reputation
If data centralization is so beneficial, why did data teams turn away from it? The answer is that centralization became a decree rather than an architectural choice. This led to a “Centralizing Everything” mindset – a strategy that ends up succumbing to serious pitfalls:
- The Cycle of Doom: Taking centralization as a default policy means that a centralized data team has to invest signficant time and money moving data before anything gets done. In most cases, this results in large engineering backlogs, delaying new data onboarding projects for weeks or even months. This directly impacts operational efficiency, particularly in the case of large enterprise data which may have complex needs around access controls and the regulatory handling of customer data.
Despite this, changing the scope of a centralization project has not historically been an option. Once they begin the process, organizations need to keep moving towards a single, central location, which does not make for agile projects. In the meantime, rather than wait, many teams are forced to create their own solutions within their respective data domains. This results in an explosion of expensive, redundant, and ungoverned “Shadow IT” projects across the company. In this sense, too much centralization can negatively impact your business processes and actually reduce your ability to create clear business decisions. - Indiscriminate centralization is wasteful. Not all centralization makes sense for your data or your business, but the nature of traditional data technologies drove an all-in approach. For example, ask yourself, are your stakeholders even using the data you’re spending so much time and money to centralize? If not, then you are spending large amounts of money for minimal benefit. Centralization, by default, can end up costing you engineering time and ongoing storage and computing costs as they hit scalability problems. This drives your data budget into freefall, which negatively benefits the business’s bottom line and makes your entire data strategy less cost-effective.
- Can create disconnects. When executed poorly, centralization can remove responsibility for a data set from the domain owners. In this scenario, the people most familiar with the data and how it should be shaped and maintained become disconnected from the strategic decisions needed to achieve success. Inadvertently, data engineers who are unfamiliar with that data domain can end up making decisions that can negatively impact data quality, easily leading to duplication or redundancies. For example, they might use incorrect logic or use outdated data sources. All of this shows the damage that the wrong kind of centralization can cause.
How to do data centralization the right way
It’s time to approach centralized data in a new way, one that brings choice back into the mix. What’s needed is a new approach to data architecture—one that allows you to balance decentralization vs. centralization on a case-by-case basis.
Let’s call this “Centralization 2.0.” It’s an approach to data management that recognizes that some critical data sets work best when they’re:
- Standardized
- Easily discoverable
- Co-located with related data
Achieving this requires a centralization architecture that lets you choose which datasets to centralize and which to decentralize using data federation. That, in turn, requires an approach to centralization that doesn’t make you buy the farm when all you need are a few cattle.
Generally, you’ll pursue one of three options for this flexible approach to data centralization. Let’s look at them one by one.
- Data warehouse
- Data lake
- Data lakehouse
Data warehouse
The modern data warehouse, with its separation of storage and compute, is a highly performant technology when analyzing large amounts of structured data. For that reason, it’s been the primary technology driving data centralization for decades.
Unfortunately, data warehouse centralization has also been the main driver of many failed centralization projects. There are a few reasons for this. First, requests for centralization get funneled through a single data engineering team that acts as the keeper of the keys. This leads to all the problems surrounding the neverending projects that we discussed earlier.
Additionally, data warehouses store one type of data—structured data held in specific formats like a star schema. The warehouse’s fixed, unchanging schema makes reusing data for other purposes difficult.
It also means you can’t leverage semi-structured or unstructured data without transformation. In other words, a data warehouse can’t truly serve as an organization’s single source of truth.
This is where data lake and data lakehouse architectures come into play.
Data lake
A data lake uses either cloud object storage, local object storage, or the Hadoop Distributed File System (HDFS). In each case, it houses large amounts of raw data cost-effectively, no matter where it’s generated in the business.
Most data lake implementations are powered by Apache Hive. This table format enables SQL-like queries on data stored in AWS S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS). The big benefit of a data lake compared to a data warehouse is that you can access all of your data regardless of structure. This includes tables, JSON files, documents, audio, and video. A data lake is truly a one-stop shop for an organization’s centralized data needs.
Unfortunately, for every pro with the data lake, there’s a con that comes along with it:
Area | Advantage | Disadvantage |
Cost | Uses low-cost object storage like AWS S3; separates storage and compute for maximum savings. | Storage costs will increase over time as data volumes continue to grow and you store and access more unstructured data. |
Flexibility | Supports semi-structured and unstructured data; enables on-demand formatting for various use cases. | It is harder to govern semi-structured and unstructured data, as there isn’t a good way to discover what’s in your lake or who has access to it; no version control. |
Performance | May improve performance by providing convenient access to data regardless of its format. | Slow query speeds compared to newer data storage formats due to limitations in Hive, including inherent architectural inefficiencies when handling data updates/deletions and slow file list operations. |
Data lakehouse
A data lakehouse is an evolution of the data lake designed to address some of its core constraints, including data centralization. Rather than using Hive, data lakehouses use advanced open table formats such as Apache Iceberg, Delta Lake, and Hudi, as well as open file formats like Parquet and ORC.
As a result, data lakehouses surpass data lakes in several key respects:
- Better support for high-velocity data. Data lakehouses perform particularly well on data that undergoes regular updates and deletes. This makes high-velocity data available for centralization for the first time.
- Better overall performance. Data lakehouses use columnar and read-optimized file formats alongside open table formats. This allows them to leverage high-performance features such as indexing, caching, and improved latency. This delivers superior performance and operational efficiency for centralized, physically co-located data.
- Better governance. Open table formats enable the storage of a richer set of metadata, which allows for improved data governance and enhanced features like time travel and partition evolution. This allows you to adapt your central repository as it changes over time.
- Better data management. Lakehouses simplify the management of data of all types, including storing, analyzing, and querying data in different structures. In this sense, a data lakehouse simplifies your data pipelines, making it a more appealing option for centralization.
- Support for advanced analytics and AI. A data lakehouse can deliver better performance than a data lake for both analytics and AI / machine learning use cases. This is especially true when leveraging high-velocity data that requires frequent updates and deletes, or where the amounts of data involved are large or involve certain regulatory requirements. In these cases, a centralized data lakehouse better supports both business intelligence (BI) tools and AI workflows.
Taking an evolutionary approach to data centralization
It’s time to rethink the role of data centralization in light of data lakehouses. By expanding the scope of use cases, improving performance, and enhancing support for governance, data lakehouses do centralization differently. With a data lakehouse, you can centralize without making centralization the default.
In the past, centralization was the automatic starting point for all data projects. That led data teams to boil the ocean, an impossible feat whose endgame quickly became the proverbial Death March project.
But the solution isn’t to abandon centralization altogether. It’s to embrace a hybrid architecture that enables both decentralized and centralized data scenarios – with centralization happening on your terms, in line with your business needs.
This flexible architecture allows you to take a DevOps approach to centralization. Instead of centralizing “Everything, Everywhere, All at Once,” you can plan smaller projects that bring subsets of data into your data lakehouse piece by piece.
How do you get started? We recommend using the approach below to approach centralization the right way.
Start small
Instead of centralizing your entire data ecosystem automatically, start with something simple. For example, if you’re a Snowflake user, you could begin by connecting a single Snowflake instance and accessing it via Starburst. Starburst can maintain a Snowflake Catalog Metstore that supports data federation while lowering Snowflake compute costs. This means you can start saving money with Starburst immediately while keeping your data where it is at first.
Explore your data
Once that’s in place, start connecting other data sources and review the impact that has on pricing. Then, experiment with drawing data from multiple systems with federated queries, seeing what works, what doesn’t, and where you might need to transform or fix data before centralizing it.
As part of this process, identify your most high-value use cases for centralization. These datasets are most central to your business and frequently accessed across the company. Centralization takes time and money, so it isn’t appropriate for everything, but when done correctly, it can bring big rewards. By tackling these first, you’ll get the biggest bang for your buck.
Expand your scope
As your project progresses, monitor metrics such as data usage, performance, and cluster capacity. Compare centralized to decentralized performance to verify whether you’re getting the expected benefits. Use trial and error to see what works and what doesn’t.
After you vet the initial use case, repeat the process. Apply the same trial-and-error approach with each new data set, especially as you support new scenarios. For example, you could move from centralizing cloud-hosted to on-premises data sets if you determine it could save you money in the long run. It’s up to you, and that’s the point. How fast you go—and how much you centralize vs. federate—is up to you and your business.
How Starburst helps you do data centralization the right way
Some data centralization can boost the performance, data security, data governance, and cross-team collaboration of your most critical data use cases and capabilities. The right way to apply centralization is thoughtfully and carefully. To support this, successful centralization requires a data architecture in which you can pursue centralization through small, rapid iterations.
We created Starburst as the driver for this flexible data architecture. With Starburst’s Open Hybrid Data Lakehouse built on Trino, you can run decentralized and centralized workloads efficiently on massive data sets, mixing and matching for the best blend of performance and cost – all at your pace and on your terms.
Starburst’s Icehouse architecture combines the performance benefits of Trino with the security and data management benefits of the Iceberg table format. It’s a unified solution for managing data regardless of where it lives.
With a Starburst Icehouse architecture, you can centralize as much or as little of your data as you want. You’ll want to move a lot once you see the benefits! But that’s up to you. Unlike other data lakehouse architectures, with Starburst, that decision remains firmly in your hands.