Generative AI, large language models (LLMs), and an ever-increasing emphasis on the value of data, has enterprises worldwide scrambling to optimize their data architecture and assets. This is in anticipation of a never-before-seen demand. Many organizations have solely focused on pulling all their data assets into one place with little consideration of the support required in the future to make this data accessible across their organizations. In a way, this creates a culture of data selfishness, often overlooking of the needs of teams and stakeholders who will realize new data access requirements in the near future.
Another trend we see as an aftereffect of centralized data hoarding is the tendency for what we call “nuclear rewrites” where data teams will simply start from scratch rather than attempt to migrate to a new platform. As new and more efficient data tools are invented, there is a strong desire to adopt the latest technologies, often leading to frequent “re-architecture” exercises. While the desire to stay current is admirable, this constant behavior of re-engineering can come with a hefty price tag.
Introduced by Zhamak Deghani in 2021, the data mesh framework provides a more dexterous approach to data management. Grouping data into independent and focused data domains removes the need to centrally manage data for an entire organization. This new approach comes with the benefit of easing the burden of one centralized team. Distributed ownership of different data domains allows multiple, smaller teams to service data to data stakeholders in a more flexible and accountable way across the organization.
At Annalect, we realized a data mesh approach would allow us to implement data governance, at scale within a short period of time and without the need to re-architect everything from the ground up. After two years of living with a data mesh, we’ve made a number of valuable observations.
AWS Data Mesh example: Putting theory to practice
Zhamak Deghani’s original paper on data mesh was brilliant. However, implementing the data mesh framework in practice can be challenging to say the least, due to its very non-prescriptive nature. At Annalect we quickly learned that diligent planning is the key to a successful data mesh strategy.
Identify data owners and plan your mesh “on paper”
The classification and categorization of datasets took weeks and was a collaboration between product and tech teams across Annalect. Yet, the exercise and time was time very well spent. “Paper is cheap” and so cataloging and classifying the data sources is very easy compared to re-engineering from scratch.
Image 1 — Annalect Data Mesh Vision
For our data mesh vision, as shown in the above image, we architected a framework where our data products have access to multiple decentralized data nodes which are needed to process the output reports. Business functions can maintain control over how shared data is accessed, who accesses it, and in what formats it’s accessed. We decided to further group data into domains by granularity:
- Events (time stamped)
- Audiences (data with a person-id identifier)
- Creative assets (creative file ID)
While the main objective was to locate all data sets across the organization, the data mesh framework allowed us to do so without actually moving data nodes into one architecture pattern. Instead, every node was able to keep its existing architecture and storage so long as it kept the type and the granularity of data within its domain.
Building a data catalog
Once each domain and owner were established, the next step was to offer a way for domain owners to share what they had by publishing the data into a data catalog. This allowed us to make the data available to data consumers while conforming to central governance standards.
Using a metaphor of a hotel, we no longer require each hotel room to paint its walls the same color or to have the same furnishings. The data mesh hotel is able to manage all its assets while allowing for diverse styles in each room, and without creating a tendency for multi-year macro renovations. As long as the room was located in the hotel, used the same key management, and HVAC, and was part of the standard booking directory it could have its unique style, servicing guests with a bespoke experience.
This distributed approach allowed Annalect to renovate its data nodes selectively, keeping the local node’s architecture intact based on what made the most sense for each specific data type.
Image 2 — The Distributed Data Mesh Approach
For example, for small structured datasets we prefer to store data in Amazon Simple Storage Service (Amazon S3) buckets with Amazon Athena as the compute layer. For larger datasets, we use Apache Spark on Amazon EMR in combination with Amazon Redshift. In the end, for each data node, we had to decide how to design the architecture across the following areas:
- Data ingestion
- Data transformation
- Data storage
- Data serving
Data Products
Easy to explore, build, govern, access and share.
New tools for data access
The main benefit of the data mesh approach is the liberated and governed data access for various data consumers. To implement a cross-team/cross-organization consumption pattern, we needed to leverage new tools that provide smooth, efficient data consumption across various data nodes with minimal data movement.
In the end, data consumers rarely care whether there is a data lake or data lakehouse under the hood, or which data pipeline architecture is used. What they do care about is that they can consume that data fast, without having to wait for access, and through the delivery mechanism of their choosing (dashboards, APIs, analytics workbenches).
Image 3 — Visualization of Distributed Compute functionality
Distributed compute
Tools like Trino, Starburst, Amazon Redshift Spectrum, and Amazon Athena have separate storage and compute. This creates a single-point of entry with data warehouse-like capabilities without building an actual warehouse. This separation of storage and compute allows us to scale each component independently, utilizing storage services like Amazon S3 (which provides a lower cost storage option), while scaling compute resources as needed.
The significance of distributed compute for easier access control and provisioning across multiple datasets cannot be overstated. In the past if you needed to join 10 different datasets in 10 different databases, you would need individual access to 10 of them. Then you would need a compute facility to join them. With distributed compute you can connect datasets in one place within matter of minutes and execute a simple SQL query in one place across all of them. Distributed compute is the practical query implementation of the theoretical part of a data mesh framework.
Image 4 – Visualization of AWS Clean Rooms inside data mesh
Data clean rooms
Data clean rooms allow us to securely query data from partners and clients, without moving the underlying data from their environment to ours. With AWS Clean Rooms, we are now able to join data across multiple entities while allowing each partner full control over how their data is shared, and how it can be queried.
Overcoming data-selfishness and collaborating externally
When Annalect decided to make the move to a data mesh approach, the main objective was to liberate ourselves from nuclear rewrites and achieve data sanity within our organization. However, one very unexpected benefit we found as a result of the transition was the ability to collaborate with other partners.
Marketing is a continuously evolving field. Being diligent about the privacy boundaries of consumers and working with clients and partners securely and dynamically, is becoming an essential concern for businesses. In the past companies would send data directly to each other to get a job done. In the marketing and media world that meant activation and performance optimization.
Image 5 – AWS Clean Rooms provides medium for secure collaboration.
With new technologies like data clean rooms and distributed compute platforms, agencies can work with client and partner data without having to copy the data from one place to another. They never see the underlying sensitive data in the first place.
For Annalect, we frequently find ourselves in need of combining performance marketing data from campaigns run by us with the additional metadata of our partners. Under the old model, we would need to get a copy of our partner’s data or have to send them our data to get the enriched version.
In the new world of collaborative data meshes, while leveraging clean rooms and distributed compute, we can join the data without disclosing full datasets or sending the data to each other. This provides all parties involved in the transaction the ability to easily collaborate without unnecessary data copies and data transfers.
Looking to the future, where there is a high need to open data assets to various LLM models, the data mesh framework future-proofs data assets and simplifies the process of providing data to new AI agents and models. Instead of restructuring to accommodate new LLM applications, one could seamlessly employ distributed compute technologies. They could integrate data from multiple domains and make it accessible to LLM models—treating LLM models as just another stakeholder.
Furthermore, if some data needs to be vectorized, converting structured data into embeddings can be conveniently incorporated as an additional step in the preparation of the data product. As we rapidly advance through new and inventive ways of utilizing retrieval-augmented generation (RAG), data mesh is a testament to the principle that data sharing and utilization are fundamental prerequisites for any data-driven environment.
Conclusion
The data mesh framework was a revolutionary concept that transformed how companies manage and utilize data assets across an organization. Data mesh enables agile and distributed architectures, establishes clear data ownership. It also allows data to be made widely available across the organization to various stakeholders for different types of consumption, both internally and externally.
Hopefully, you are now inspired to consider the benefits of cross-collaboration through new modern techniques of data management.
Originally published, on AWS.