Everything I knew about enterprise data warehousing is now failing me
Isaac Obezo
Staff Data Engineer
Starburst
Isaac Obezo
Staff Data Engineer
Starburst
Share
More deployment options
In my first blog post, I showed how to use Starburst to create resilient data pipelines. I also explained how I was able to shift away from traditional enterprise data warehouses (EDW) as a way of avoiding their often fragile ETL pipelines.
In this post, I will expand on the EDW as a topic and share lessons that I have learned in my time working with cloud data warehouses. I will also explore the ways that EDWs sometimes hold organizations back and explain how a fresh, decentralized approach to enterprise data can fulfill the promise of modern data analytics.
What are the benefits of an enterprise data warehouse (EDW)
Data centralization around an EDW has traditionally been the bread and butter of data engineering. Businesses thrive on data, good and bad. Across the data industry, data professionals collectively build pipelines to centralize data for most of our careers. You can think of this as the standard model.
Flaws in the enterprise data warehouse (EDW) model
But this model has flaws. One of the biggest challenges facing the EDW model is the organic evolution of data over time. Data typically originates from many different systems within the organization. Organizations also create data in different regions, involve different domain experts, and manage and maintain this data in varied ways. In general, data is broad. It can be created from any source, including business data, historical data, real-time data, transactional data, operational data, and unstructured data. It can also be created anywhere, including apps, IoT devices, on-premise, the cloud, and in any geography.
Single source of truth is unrealistic
Providing business users with a single source of truth is idealistic. Sure, there is a shared point of reference and this tends to be pipelined into a central repository. The goal is to enable analysts to build insights and support a data-driven organization and decision making process. Certainly, data warehousing architectures have historically been the right solution, first with on-premises providers like Teradata and then cloud-based data warehouses like Snowflake.
Warehousing has worked, has been the standard, and has enabled businesses to become more data driven. But the unstructured data that is used, and available, tends to hide (aka. dark data) the chaos that exists below the surface.
Challenges of building and maintaining an enterprise data warehouse
Like most data engineers, I have seen first-hand how challenging it is to build, let alone maintain, an enterprise data warehouse. One post is not enough to cover all the challenges, so for now, I’ll focus on these three: unpredictable costs, data complexity, and access control.
1. Unpredictable enterprise data warehouse costs
On-premises data warehouse costs
One key issue facing enterprise data warehouses is unpredictable costs. This is particularly true for on-premises data warehouses. These systems are generally proprietary that bundles functionality, storage, and compute. They involve a fixed infrastructure investment, which becomes hard to change when the need arises. Instead, organizations are forced to simply accept the constraints or spending money on idle capacity just in case you need it.
In my experience, building an on-premise data warehouse is a struggle between better planning and scaling with budget and costs. Usually planning ahead for a year or two, which I myself have not been able to be more prescient, falling short on needs when our on-premise warehouse takes off.
Cloud data warehouse costs
Cloud data warehouse solve one cost problem, but introduce another. Specifically, these solutions solve the scalability problem, offering nearly unlimited compute and storage.
However, this approach introduces a new problem. With unlimited compute and storage also comes unlimited costs. Did I spend too much on compute, or query, or storage? It can be a struggle to understand where to constrain costs.
If you happen to find a more affordable and predictable data warehouse platform, you still have the arduous and expensive task of justifying the migration, decoupling from your existing systems, and moving to this new platform. Migration costs alone can make vendor lock-in a very real possibility. Often you are unable to decouple from vendor lock in.
Many times, decoupling is not realistic.
2. Data complexity overwhelms centralized data warehouses
Centrality presents its own difficulties. Even with the promise of cloud warehousing, data scalability can be limited. Storage capacity can not always keep pace with data volumes nor the needed velocities, many times it requires more work at ingestion just to keep data and pipelines manageable and affordable.
Data warehouses struggle to change
Data warehouses tend to also struggle with scale when business demands change. Workloads are increasingly becoming more complex, especially as businesses continue to adopt artificial intelligence and machine learning. Large complex transformation jobs are testing the performance limits of warehouses, cloud and on-premise.
Data itself is often decentralized
Data itself is also becoming more complicated; more diverse, more variant, more informative and detailed. Business decisions are not based just on structured, well formed, schema ready data anymore. Warehouses were not really designed semi-structured data, for logs, clickstreams, IoT, and other non “normal” data streams.
Data quality comes at a cost
Data quality in the face of this complexity can contribute to the increase in data warehousing costs. New data requires development of pipelines, pipelines require testing, data requires testing and quality checks and gates. Data projects are rarely straightforward and easy. Every pipeline requires a great amount of care and feeding to make sure the warehouse has quality data.
3. Managing enterprise data warehouse access and control at scale
Access control is the third challenge. One of the main goals in centralizing data using a data warehouse is to make that data more accessible to business users. But how do you democratize access at scale? Specifically, how do you provide the right amount of access to enable business users at different levels to run the analytics they need, while maintaining other security concerns?
And more importantly, how do you scale access while also complying with data privacy and sovereignty regulations? Access policies for each data set may change depending on who the user is, where they are, and whether the data access needs to comply with Texas, California, France, or anywhere else.
EDW Access presents a logistical problem
Access quickly becomes a logistical nightmare. While data warehouses enable some data integrations with existing infrastructure, frequently it isn’t enough. To overcome this, data engineers typically work with different teams to overcome the gaps. These might include working with data governance teams, who help to build access control that works with the warehouse model. However, building these access controls requires a lot of maintenance and adds to the logistical backlog.
Access control can obscure data visibility
Securing data, controlling access, and providing data to end users can be a balance of give and take. But often the real access controls were left behind on the original systems as we move the data through the data pipeline to the data warehouse. In this sense, pipelining data to the EDW can often obscure the original intent of access control.
These are by no means the only data warehouse challenges we face, but they are consistent across companies and industries.
4. Limitations of a data warehouse architecture
As data engineers, we spend a lot of time, effort, and resources resolving data and data warehouse challenges. When this happens, it is easy to be lulled into the comfort of thinking that the data system you work on just works. Similarly, it can be reassuring to believe that the challenges you face are negligible, or can be resolved easily. My journey with Starburst has been an awakening from some of the challenges that are part of a centralized warehousing model for data.
How centralization hurts the data warehouse model
There is an incongruity at the heart of the EDW model and it involves centralization. Despite being the cornerstone of all data warehouse models, centralizing enterprise data is impossible task because enterprise data is fundamentally decentralized.
Let me explain.
Data is generated in a decentralized way
Disparate data is everywhere and it is generated by everything – every system, every application. From manufacturing to HR, from domestic and overseas offices, data’s exponential growth is a constant force. Companies need to store data on servers, personal devices, cloud platforms, as well as SaaS applications. The list goes on, and we are often building systems to catch up with some data source that is already pushing data.
Data centralization can never capture all the data
However, this view does not tell the whole story. In fact, in many cases, large amounts of data never even enter a data warehouse in the first place. In other cases, they are added to the warehouse but are never used. These difficulties can arise for many reasons. In my experience, the most common reason is that the data in question was never asked for in the first place. In other cases, it was asked for, but was forgotten. So, we’re constantly integrating new sources into the warehouse.
Some data should not be centralized
Some data you can not or should not centralize at all, there are always legacy systems, there might be legal or ethical constraints, there are many reasons why data cannot be warehoused.
Data warehouses can not cope with the endless flow of raw data from real-time sources. You have to hope you make the right assumptions, at the beginning of the process of building and pipeline, or when you sample and aggregate data. In many cases, compliance with data sovereignty regulations prevents you from moving or copying data at all.
Centralization is one of the main frustrations in data engineering
When you’re running a centralized data architecture on top of an inherently decentralized information ecosystem, you get all the challenges that make data engineering so frustrating. Data warehousing requires planning, pipeline development, and constant maintenance. It is no surprise that many are seeking alternatives to the traditional data warehouse and pipeline models.
5. Decentralized data with Starburst makes data more accessible
What I’ve learned with Starburst is there is a better model than a central warehouse for data. Starburst enables you to build a data analytics infrastructure that reflects the nature and topology of your existing data infrastructure.
The decentralized Starburst approach
This approach introduces an abstraction layer which allows you to bring together different data sources which might exist along internal organizations, regions, business units or even different on-premise or cloud storage platforms, as well as relational databases and data lakes, and data lakehouses. Over time, if you want to shift workloads from more expensive data warehouses to less expensive data lakes and data lakehouses, you can do that too. You can even use a Starburst Icehouse architecture, making use of Apache Iceberg.
Instead of moving data into a warehouse, the data remains in its original location with its original attributes and characteristics. With Starburst, we are able to have a single interface with a single source or truth and access for engineers, analysts, and data products. As a result of this federated architecture, the challenges we discussed earlier disappear.
Decentralization saves money
Decentralization is more cost-efficient because it allows us to separate storage from compute. We tend to invest in storage where growth is more predictable and easier to optimize. When storage and compute are decoupled, instead of the case of a single proprietary platform, Starburst lets you query, transform, and process big data more affordably on Amazon, Microsoft, or Google’s scalable cloud platforms. With over fifty enterprise-class connectors which will seamlessly unify data from different sources and eliminate the data warehouse’s pipeline development and maintenance costs.
Using data federation successfully
Using Starburst to build a federated architecture simplifies your data management system. Data volume and velocity are less of a challenge when historical and real-time data sets are just as accessible. Data products seamlessly integrate structured, unstructured, and semi-structured sources. Complex workloads take less effort to prepare and run.
My biggest revelation has been seeing how decentralization and abstraction transform data access. Starburst’s connectors automatically deal with each source’s unique take on SQL. Instead of learning a new dialect for different systems, analysts can use their existing tools and SQL skills to access any data source directly, requiring less reliance on a data engineering team.
Democratizing data access
Don’t get me wrong, as a data engineer, I have a great sense of pride in working with people and unlocking data they need for decision-makers and business users. I am not inclined to be a gatekeeper and if being a data superhero means I have to also be a data gatekeeper for routine requests, I am not doing my job well. My goal is to build for data users, and enable data driven organizations. There is plenty of work to be done, unblocking people efficiently helps me in the long run.
Democratizing access in this federated model isn’t the compliance nightmare you’d think it’d be. Starburst’s single point of access makes it a single point of access control where you can turn role-based and attribute-based rules into fine-grained controls at scale. Data access can be modeled to respond to any legal compliances: data is protected and there’s a least privilege model.
Don’t rely on the data warehouse for everything. Decentralization is inevitable as you grow.
Working on data warehouses taught me lessons that are still with me today. But it wasn’t until working with my team at Starburst that I could put those lessons into perspective.
All the hard work it took to keep pipelines working and get projects across the finish line shared a root cause that I couldn’t see at the time. We were trying to force a centralized model on a fundamentally decentralized — and uncentralizable — system. Our model didn’t match reality. Often my job would be maintenance, troubleshooting, and “hot-fixing”. I could be pulled into problem after problem, and not be able to be more strategic or build better data systems, because we (as data engineers) tend to keep all our systems afloat.
Perhaps there is an overall emerging pattern I’ve observed that is even bigger than centralized data warehousing.
The way I see it: there is a parallel between software development, decentralizing code and processes, and data-engineering. Software went from a highly centralized process to a more node (peer) process (the internet and networked systems led the way).
Bottomline: Decentralization is inevitable and that is how I would approach data management today.
Now that I have a different perspective, decentralization is the better way to reach the goal data warehouses were supposed to achieve. Starburst’s federated approach to data analytics lets you embrace the complexity of modern data without reducing it. It lets you use the right storage for the moment, whether that be a data warehouse, data lake, or data lakehouse. This makes data easier to manage, access, protect, and use to generate business insights.
Breaking data silos
How 8 companies gained greater data warehousing value with Starburst