Data Mesh: Embracing Decentralized Data Paradigms
Share
More deployment options
Many data and analytics practitioners have heard about this socio-technical paradigm shift, Data Mesh, and would like to learn more. But before describing what Data Mesh is, it’s helpful to review how we got here and the challenges this evolving paradigm solves:
Too much data from too many sources.
As businesses embrace full digital transformation, the exponentially growing volume of data pouring in is staggering. Data points flow in dynamically, on a global level, at a level of granularity never before contemplated.
While operational data has always been used as an analytical tool to drive business decisions, we’re now seeing BI and AI providing game-changing insights, driven by always-on transactional data for marketing, sales, and product development teams as well.
These customer-facing teams can finally know what works and what doesn’t, instantly, and based on every action taken by customers. It’s an extraordinary power to have, but the amount of data they have to work with is challenging to collect, store, organize, curate, and query.
Data lakes: no longer the single source of truth.
No one is going to argue that siloed data is a good thing. In fact, hundreds of startup companies have emerged, offering solutions to break open those silos.
But while the goal over the past decade has been the unification of data sources into a single data repository to yield a “single source of truth,” that repository suddenly, amazingly, feels like yesterday’s strategy. Why? Because it introduces several limitations while that single source swells day by day:
- Large-scale enterprise data management is challenging and messy. In particular, it’s a struggle to integrate live, flowing data into static or historical data.
- Data transitioning in and out of a data lake from edge sources — and managing its storage once it arrives — is time- and resource-consuming, and very expensive. Bottlenecks become more frequent, and business agility declines.
- A single, aggregated collection of data cannot easily comply with data residency and privacy regulation compliance that varies from country to country; data governance is going to be geographically diverse in a global organization, whereas the data lake and hardware likely are not.
- Finally, and often the most painful feature of a bloated data lake, is the reality that query overhead doesn’t scale. As more users need to query the same database, add sources, or manipulate what’s there, response times slow down. This assumes, of course, that the data lake incorporates true data virtualization to seamlessly allow anyone with permission to connect to any data source or platform, an important concern according to our recent survey.
There’s too much work for a centralized data team.
Enterprise data teams are caught between data providers and data users. Serving disparate business domains, each with its own complex, changing, and ad-hoc requests, is taking its toll.
Centralizing control and access among a limited number of users means that even minor requests often take their place in a long queue, delayed until the data team can assess the request, create the pipeline, and provide the data for analytics-based insights. Residing in their ivory tower, often far from the business users (geographically and tactically), makes this all even more of a struggle. Expecting a centralized team to understand and curate data at scale is an untenable expectation as organizations grow and complexity increases.
How do we maintain the benefits of a centralized, standardized data lake while introducing scalability and access that currently don’t exist? How can we allow the folks who understand the data best to curate the data for downstream users and accelerate analytics? Can there be such a thing as a “distributed data lake”?
What is Data Mesh?
A Data Mesh is a decentralized approach to data management, where the data itself remains within the business domain that has collected it, cleaned it, and now manages it. Those domains are responsible for create “data products” which are curated, discoverable, secure, and fully described groups of related datasets, ready for consumption by downstream users outside the domain.
This may sound like the data is being now siloed within the domain, but there’s one critical difference: SQL clients from the entire organization can query these data products with a distributed query engine. On top of this privately-owned coherent business data, sits the distributed query engine that can access and unify it for interoperability, rather than storing it centrally.
In other words, the data is now democratized. It creates “data as a product,” a standardized offering, available for anyone with permission. It’s secure, in compliance with local regulations, and now considerably more scalable. As additional data products are created and new domains come online, they follow the Data Mesh pattern and provide additional value to the downstream users. With the Data Mesh architecture, the domains creating the data are empowered to own decisions about which data is provided as a product.
Data Mesh empowers data consumers
The upshot here is automated, comprehensive, instant analytics at scale. Data scientists and data consumers with less expertise and training will now be able to access business data, conducting their own analysis focused on their own business needs. This self-service data infrastructure strategy represents for the first time a people-centric plan for data management; a faster and more effective way to get answers without taxing a DevOps team.
Zhamak Dehghani, Director of Emerging Technologies at Thoughtworks, is credited with creating this paradigm in 2019 at an O’Reilly conference (she named it later, when she literally wrote the book on the subject). She refers to Data Mesh as “a decentralized sociotechnical approach — concerned with organizational design and technical architecture.”
Data access drives insights
Data Mesh is also, in a sense, the next phase in the “anyone/anywhere” model that we’ve come to expect from cloud computing and data virtualization.
A business domain’s own applications and access tools are usually designed for its own users and their specific needs. And in an ideal situation, its data is local, so latency is minimal. But if members of one business unit seek data from another, they are limited by their own frameworks.
If they do gain access to that centralized data lake, its remote location (and size, most of which isn’t the business unit’s own data) typically increases latency. With a Data Mesh, it is easier to have systems interact, share their on-site data, and make the results available to a diverse group of business users.
These may be completely independent teams, say HR and R&D, or cross-functional teams with the same goals and often the same data (i.e. QA working with product management, or sales working with marketing). This new effortless transparency promises new levels of productivity.
Empowering data democratization
While widespread implementation and adoption aren’t going to happen overnight, many organizations are embracing Data Mesh architecture to democratize and scale their data.
However, this move puts responsibility on data teams to become truly autonomous. They will need to ingest and clean data themselves, curate the data based on downstream use cases, expose it to downstream users, and manage access control. At the same time, the more they invest in these fully-owned steps, the better results they can expect. This means we are entering a new era of mutually beneficial sharing as each domain helps the others by simply transforming and offering their data to their community.
I’ll close with a parting thought from Zhamak, as she describes the overarching value of distributed Data Mesh architecture, with domain-owned data under a centralized access system. “Over the last decades, the technologies that have exceeded in their operational scale have one thing in common: they have minimized the need for coordination and synchronization.” As Data Mesh gains popularity, I’m looking forward to seeing how technologies like SQL and Starburst can assist organizations in their journey toward a decentralized data strategy.