In this post, we discuss data mesh vs data lake to understand the differences and how you can use a centralized data lake for a distributed data architecture, emphasizing data products.
Centralized architecture vs. Decentralization
Data lakes are centralized repositories designed to store vast amounts of data in a scalable and cost-effective manner.
Data mesh decentralizes data ownership, distributing responsibilities to individual domains or business units, fostering a more collaborative and scalable approach.
Data governance as it relates to a data lake vs data mesh
Governance in a data lake may face challenges such as excessive security and data accessibility. For example, accessibility often face challenges related to over-provisioned access and potential restrictions on data ownership.
Data mesh integrates federated computational governance, enabling each domain to have autonomy over its data while ensuring overall compliance. Data mesh ensures that each domain or business unit owns and manages its data, promoting self-serve data infrastructure and accessibility.
Data lake vs Data mesh: Scalability
While data lakes can scale, they may face challenges related to constant attention, maintenance, and potential scalability issues with increasing data volumes. With a modern data lake, it brings quality to your data lake by adding key data warehousing capabilities such as transactions, schemas and governance.
Data mesh addresses scalability by distributing data ownership, allowing each domain to independently optimize storage and compute, resulting in a more scalable solution.
Data lake architecture
In response to the challenges of data warehouses, the data lake architecture emerged. Many were thrilled with this new option because of its access to data based on data science, machine learning model training workflows, and support of parallelized access to data.
The data lake architecture is similar to a data warehouse in that the data gets extracted from the operational systems and is loaded into a central repository.
However, unlike data warehousing, a data lake holds a vast amount—terabytes and petabytes—of structured, semi–structured, and unstructured data in its native format until it’s needed. Once the data becomes available in the lake, the architecture gets extended with elaborate transformation pipelines to model the higher value data and store it in lakeshore marts. Essentially, we moved from ETL to ELT processing.
The data lake architecture is often described in the following way:
- Data is extracted from operational databases
- Data is raw and minimally formatted
- Data is accessed through the object storage interface
- Data lakes are designed to handle enterprise-grade analytics
- Data lakes also answer big questions such as: “How is your business doing?” and “What investments and opportunities should you be making?”
You can see from the visual below that a data lake architecture generates complex, unwieldy data pipelines resulting in unmanaged, untrustworthy and inaccessible data sets. Also as data lakes grow in size and in usage, they become expensive to scale and to meet the performance demands of the business. Unfortunately, we still relied on a centralized team to perform the ELT, so again, as business users request a change, they have to wait for the central team to respond. Similar to the data warehouses, this approach limits the value of data to data analysts, which ultimately restricts the business in making informed data-driven decisions.
Related reading: Data mesh architecture
How Starburst helps with your data lake and data mesh strategy
“Data Mesh is certainly the future for our business, and probably for many others, particularly ones which have a legacy of acquisitions, and the need for merging of different data sets to form a new larger entity. Having the ability to query data where it resides using Starburst is enormously powerful and makes a huge impact on the ability for data to provide answers.” Richard Jarvis, CTO, EMIS Group
“The implementation of Starburst on the data lake allows analysts and data scientists quick and simple access to data that exists in the organization for business value and insights. ETL processes that took many months and at high costs have become extremely fast and accessible to analysts at negligible costs.” — Shlomi Cohen, EVP, Head of Business Data and Analytics, Bank Hapoalim
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).