Data Sharing Culture With Your Data Lakehouse

  • Duy Huynh

    Duy Huynh

    Solutions Architect

    Starburst

Share

Corporate data is no doubt a valuable asset. Except it’s an open secret that data alone isn’t inherently valuable, nor will it produce valuable insights. Data and analytics leaders also know that managing data and creating insights are also insufficient in accelerating digital business transformation. These key activities must deliver measurable business outcomes. One solution to meet this corporate imperative is to embrace data sharing.

Defining data sharing

Data sharing enables real-time access to data from various sources, internally and externally, such that data consumers can synthesize insights and collaborate cross functionally with various analytical teams. Organizations that cultivate a data sharing culture with the right infrastructure, improves data discoverability, use, and reporting. Business leaders as well as departments across the organization that operate with trusted data-driven reports are in a better position to meet their corporate goals and objectives.

According to the Sixth Annual Gartner Chief Data Officer Survey, respondents who successfully increased data sharing led data and analytics teams that were 1.7 times more effective at showing demonstrable, verifiable value to data and analytics stakeholders.

What are the benefits of data sharing?

There are numerous benefits to data sharing, particularly in our post-COVID world. In fact, it’s become a business necessity for users to access the data when they need it. When you are able to share data and combine data from different sources — internally, customers, and vendors — you can increase the overall performance and value of an organization’s products and services. Data sharing also enables an unprecedented level of collaboration which drives data-backed decision making and improves business outcomes.

Data silos impede data sharing

Without data sharing between systems, people, and departments, some may be unintentionally kept in the dark about what is going on in other parts of the organization. This is what’s also known as data silos. It’s an environment with incomplete and inconsistent data, duplicate data sets as well as less collaboration between users — all of which undermines productivity and efficiency to maximize business and social value via sharing data.

Barriers to data sharing: security, regulations, vendor lock-in

Many data and analytics pros know that there are more than a few challenges to enable data sharing. First, it’s enabling the security and privacy protocols to keep critical business information secure without compromising privacy and productivity. In today’s global economy, it’s impossible to understate the importance of being able to access data efficiently and securely, regardless of where it resides.

Also, as external data sharing becomes increasingly common, data use agreements and regulatory requirements, including data localization laws, make it difficult for data platform teams to keep up and manage a compliant system. Historically data was locked in a warehouse and data sharing was only possible if you were in that warehouse. But imagine what you can do with your data if storage was not a constraint, limiting your ability to share data? You’d have access to all data stores and various data types —structured, unstructured, semi-structured— with your tools and systems of choice.

A data sharing mindset begins with discernment

To cultivate an environment that fosters data sharing, work with your business leaders across departments to create a data-sharing mindset. Distinguish your data management strategy between your data lakes, databases, CRMs, ERPs and data warehouses. For instance, how is data ingested and from which sources?

Yes, data warehouses enabled structured data for business intelligence and reporting, but when businesses discovered the power of unstructured data that came in the form of photos, spreadsheets, documents, videos and more, organizations needed a new way. Since 80-90% of unstructured and semi-structured data are available to organizations, the data warehousing model approach led to the data lake, where unstructured data is stored in its raw format.

Typically, data lakes built for querying utilize open file formats that have been transformed using ELT/ETL tools. The initial landing zone typically may have JSON, CSV, XML, or binary formats from all of your source systems — including large exports of on-premises databases.

Parquet, ORC, AVRO are traditional formats and are common for query usage. They are compressed and have schema built-in. These file formats, by themselves, don’t have data security features that would be fundamental for data sharing, or reliability built-in. Instead, data engineers and architects will often decide the best way to partition data and organize their lakes in a way that achieves the best performance for specific query use cases.

Most will understand there are always trade-offs depending on the information they want to retrieve, especially, at scale. Traditionally, after transforming data into a clean final format, this data would have to be moved towards a dedicated enterprise warehouse to serve to end users.

Data lakehouse address the challenges of data warehouse and data lakes

While the data lake solved a lot of the structured data problems and decoupled compute and storage, if organizations aren’t careful, they could end up with data swamps. Meanwhile, lakehouse is a term coined for next-generation data lakes with enterprise warehouse features.

Open file formats such as Iceberg, Delta & Hudi enable next generation capabilities that previously were not available — ACID (atomicity, consistency, isolation, and durability) transactions at high performance alongside reliability, schema evolution and time travel, to name a few. These file formats paired with a computational engine like Trino (Starburst engine) enable fast data access, at scale, at significantly lower costs than traditional data warehouses.

Business, legal, & engineering are involved in governance

A proper data lake setup with next-generation file formats described above is fundamental to enabling your users to share data regardless of who is accessing the data. Modern query systems can apply computational governance on top of your data lake.

In order for data sharing to work properly, it requires a holistic approach within the organization. Business leaders and domain owners are best suited to understanding what the data looks like and what should be shared.

Also, communicating what domains look like to data architects is vital for the data lake model to be visible to your users and governance systems. Below is one example of what a lake could look like, and an enterprise may have many ones in parallel under different projects or cloud subscriptions depending on which cloud they have deployed on. Some organizations include a lake just for development or exploration, a lake just for raw data collection, and a lake with transformed and curated data.

Having meaningful structure in your data lake ensures mapping governance and security to your user groups a little bit simpler. There also needs to be a comprehensive data security framework that can be overlaid on top of this system. Tools like Apache Ranger (which Starburst has built additional features such as dynamic data masking and row level filters) enable centralized security using an interface or REST APIs.

Connected governance improves data sharing

Gartner refers to the ability to federate governance as connected governance. Whether you use role-based access control or attribute-based access control, this enables your teams to have granular control of your users over your data. Sharing data between disparate systems has always been difficult without tools that can federate data and ensure proper access controls are enforced.

When the right governance is applied: accountability, lineage, and auditing — users are ready to explore their data on the lakehouse.  Finally, ensure the legal teams are connected to the infrastructure teams building governance to ensure compliance.

Empower your users to self-serve queries

With governance in place, data analysts with a self-serve infrastructure to create their own insights are empowered in their data analytics. With Trino (or Starburst) configured to your data lake, all of your data can be easily accessed with one tool. Your lakehouse can drive your everyday data needs and you can enrich your data with data from other systems, making data access simple. Your internal teams will find it easier than ever to work, explore and share the data insights from an analytical perspective. If we think back to the data lake model above – /raw-transformed and /clean will contain queryable files for your system.

Users can do some amazing things now to share their insights – they can explore the data using ANSI SQL, create Data Products, and even do Fault Tolerant Execution to create their own datasets and filesets without the need to fully code on the lake.

If we look back at our lake model – analysts can conduct their own self-serve batch workloads. First, they will query the data sets they need. Then, they can create some great data products and have them discoverable for the rest of their organization using SQL. These data products can be listed in a data cataloging tool like Amundsen, or Datahub, or viewed within the BI tools themselves. Developers can use Starburst REST API to create their own custom internal data sharing services.

Data products need data owners

Lastly, tie your data products and datasets to product owners. This is similar to a data steward— someone responsible for the trustworthiness of the data and who can quickly ensure the datasets are curated properly for query usage in your organization. Users in your organization can message the right person in order to clear up any non self-describing data or usage of datasets.

Rate, comment and find frequently queried data products and data sets. Data product owners can better work on improving their own datasets tied to domains. By having this intelligence to understand how data is being used within your organization, cross department initiatives and key metrics can better be met and served by your analysts.

Data sharing culture done right enables greater profit margins

A data sharing environment creates new revenue streams, improves data-backed decisions, generates better cross enterprise collaboration and a unified team, all of which are tied to your business KPIs and initiatives.