Discoverable. Understandable. Trustworthy. These are just a few of the key ideas of a Data Mesh infrastructure. Go through all of them and you’ll quickly find a common element that’s key to achieving each one — metadata. Today, metadata has become big data. Every component of the modern data stack and every user interaction on it generates metadata. It’s not hyperbolic to say that the potential for metadata is immense. In fact, I think it holds the key to unlocking true value in your Data Mesh. However, metadata — just like the Data Mesh — has some growing up to do. As any practitioner knows, the explosion in the size and scale of metadata has made it difficult for data teams to actually use it. All too often, it ends up unused in yet another siloed tool.
I’m a lifelong data practitioner who’s spent a decade running data teams and building great data cultures. I’ve had tons of failures and some successes, like building India’s national data platform. I’ve dealt with a ton of chaos along the way, which is what prompted my team to build tools to make ourselves more agile. Four attempts later, we built an active metadata platform that truly transformed the way we work.
I had the opportunity of presenting at Datanova 2022: The Data Mesh Summit, Starburst’s second annual conference. I shared my thoughts on modern metadata, where it’s falling short in our data stacks, and how we can use it to power the Data Mesh dream. I’ve captured the key takeaways below.
What’s not working
The way metadata is used today is just plain wrong. We compile all our metadata from a bunch of separate, siloed tools, and put it in its own siloed data catalog or governance tool. We expect data practitioners to switch back to that catalog tool whenever they need context, but that just doesn’t work. Users want context where they are when they need it. When someone is in the middle of their daily workflow, like looking at a dashboard on a BI tool, they need to know if they can trust that dashboard — right then and there, not after switching to and logging into their data catalog, searching for the relevant data assets, and finding out when they were last updated.
The other issue with our current approach to metadata is that it treats every data person the same. We’re all used to incredibly personalized experiences in our normal lives — just think of Netflix’s curation or Amazon’s recommendations. Data platforms today have tons of data about their users, like what searches the CTO runs on Monday mornings or what assets data engineers use most in their pipelines. Why aren’t we using this data to personalize data experiences for different people across different domains? Lastly, today’s metadata mostly lives within a top-down governance model. Most data governance tools were built for centralized governance structures, with “committees” setting rules and policies. This is the opposite of the bottom-up, democratized world of a Data Mesh.
The future of metadata
Take a moment and imagine the experiences in your daily life. You use Segment to do what used to be a massive manual effort, getting customer data from dozens of websites, apps, and tools. Then Zapier steps up to the plate to make this information actionable — if something happens in one tool, it triggers a series of actions across other tools. After you log out of these tools and go home, you switch on Netflix, where they’ve curated the latest shows, and even created some new ones, based on exactly what you want to watch.
Why can’t we make this possible in our data platforms?
A truly intelligent data management system should have the end-users and their needs in mind. It would use metadata to naturally federate and govern our data systems — powering automated, personalized experiences built around today’s diverse data users. It should be created around questions like, “What does the experience for a data analyst look like? What about a data analyst in marketing or sales? How is that different from the experience for a data engineer or business user?” Then we can drive this personalization and knowledge back into the tools and experiences that data people are using every day, like Jira, Slack, and Microsoft Teams.
For example, imagine that you’re in your Looker dashboard, where you can see all the context from the other tools in your data ecosystem. You can immediately learn who the owners and experts are, whether you can trust the dashboard, and whether the pipeline powering it is updated. This is when data becomes “natively accessible,” one of the most important goals of the Data Mesh.
Metadata and the Data Mesh
One of the key concepts behind a Data Mesh is federated computational governance, or a system that uses feedback loops and bottom-up input from across the organization to naturally federate and govern data products. Metadata is what makes this possible
An automated workflow for finding and governing data product health.
(Image from my talk at Datanova 2022: The Data Mesh Summit.)
For example, consider the automated metadata workflow above. With usage metadata about what assets people actually use, we can create a product health score about how much each data product is used and updated. We can then sort products by health score and match each to its data product owner. For great products, the product owner can get a Slack message: “Congratulations! Your data products are doing great.” Low-quality or outdated products can be automatically removed or deprecated from the discovery experience, and a ticket to improve the product can be added on Jira for the relevant owner.
This is what I see as the future of the Data Mesh — creating a true action layer using metadata to power fundamental concepts behind the Data Mesh. By bringing metadata from lots of different places, we can automate and orchestrate fundamental concepts behind the Data Mesh like democratization, discoverability, trust, security, and accessibility.
Metadata in practice
By now, I’m sure you’re asking, “What does it mean to truly implement active metadata practices in a Data Mesh?” Metadata allows you to shift from siloed context to embedded context (domains), generalized experiences to personalized experiences (data products), minimum automation to truly autonomous (self-service infrastructure), and top-down governance to democratized governance (federated computation governance).
But what’s the first step to take this from theory to practice? Create data product shipping standards. I’ve already touched on the characteristics of “data as a product,” a key concept behind the Data Mesh. Each of these can be codified into shipping standards at the organization or Mesh level.
The characteristics of “Data as a Product,” a key concept behind the Data Mesh.
(Image from my talk at Datanova 2022: The Data Mesh Summit.)
For example, for “Understandable,” I’ve seen a lot of success using the 5W1H framework to define what makes a data product understandable. As an organization, you can choose which of the elements in the framework matter most and focus on those.
The next step is fully automating these processes. This makes infrastructure truly self-service, a crucial part of the Data Mesh paradigm. For example, to make data products understandable, you can bring in context from across your data stack. Parsing SQL logs can be used to automatically rank the popularity of each data product, at the column level. Context from data pipelines can be used to create column descriptions based on a data product’s source. This step may require the implementation of new tools to facilitate it, but at the end of the day, Data Mesh is a cultural and mindset shift. That’s why the last important step is to incorporate human-driven standards and rituals into the product shipping process.
At the end of the day, you’re asking your engineers and developers to start thinking about their roles differently, and that’s not easy. It’s about cultural change, not just tech. Set your data team’s values (e.g. reusability), create rituals to help everyone achieve them (e.g. a Documentation Hour), and you’ll eventually see a real shift in people’s mindset and productivity. All in all, it’s clear that the future of Data Mesh looks bright. For more on my thoughts on the subject, check out my full talk from Datanova 2022: The Data Mesh Summit!