As more data than ever is being produced, organizations strive to capitalize on capturing every customer insight possible. In a perfect world, gathering all this data would automatically lead to smarter decision making. However, understanding this plethora of source data to determine business insights has never been more challenging than it is today.
Ask any data engineer about their experiences building a new data pipeline and they will probably share a tale that involves enduring a laundry list of prerequisite phone calls with multiple teams trying to figure out who owns the source data in question. It’s no secret that companies struggle to organize the vast amount of data they are collecting through different systems, let alone identifying which data actually adds value.
It’s all too typical that a business request for new insights, such as a new dashboard or new calculation, is expected to be a small change. In reality, the data owners themselves are rarely confident the data exists, let alone where it’s stored. Therefore, the data practitioners are left closing the gap, forced to translate the qualitative expectations into quantitative analysis and act as a detective to track down the right data owners who can help them gain access to the data.
All this effort requires precious time, breaking down valuable trust as frustration and doubt with the end user increases exponentially as the timeline increases. Once access is finally granted, an unfortunate individual will spend days understanding the data before they can even start writing queries to generate the requested insights. In the end, writing and operationalizing the desired SQL queries is the easiest part of the entire process.
In a data-native organization where data is generated, stored, and cleaned at an exponential rate, it becomes essential that these typical data silos are broken down and that knowledge is easily and readily attainable for all. Data engineers, data producers, and data consumers must act together in collaboration to understand the entire picture of the data lifecycle. Increasing this transparency and empowering data consumers will only benefit all parties through the elimination of unattainable asks and the increase in proper prioritization requests.
Learning about these challenges from our customers, Starburst is developing an experience where data engineers, producers, and consumers can discover, understand and collaborate on data within Starburst Galaxy. Instead of data context being limited to a select few, now organizations can use metrics to enhance the context surrounding each data source and ultimately make data discovery easier.
As part of the first release in this journey, Starburst is adding new capabilities:
- Catalog explorer helps users find and discover data and understand the data’s who, what, when, where, and why. Catalog explorer enables data users to easily search and understand their data, where it lives, and where it came from. Metadata is automatically populated with query history, providing valuable information into how data is used.
- Schema discovery takes this a step further by enabling the discovery of not just existing datasets across sources and clouds but also net new datasets, no matter what form they’re in. For example, use schema discovery on a data source in your data lake to generate a table from the JSON source file without being forced to play bingo with each column data type.
- Granular Access Control enables data administrators to see and understand who has access to what data and how it’s being used in the context of the data itself. This means data administrators can monitor and change permissions through policy as code to ensure security and risk reduction within a continuous integration / continuous delivery / continuous deployment (CI/CD) pipeline.
With the above functionalities, organizations can ensure democratized data usage within the company while still maintaining effective guardrails. Not only can data engineers gain increased data literacy, but data consumers can also discover data and its corresponding metadata automatically populated from query metrics, leading to a deeper understanding of this data available. Now, cross-functional teams can elevate the conversation around data to understand why and how queries are being written across the organization.
Theory to Practice
In today’s modern organization, data is generated and cleaned across different teams through a variety of different tools. One of the most widely integrated, dbt has become a popular solution to generate data models because of its ability to employ software engineering principles to data analytics. Using a dbt project to test the power of these new capabilities in Starburst Galaxy, see for yourself how each can make an impact in daily data discovery.
Schema Discovery
Data lakes are difficult to navigate if you are not familiar with a fellow team’s organizational structure. To quickly and easily confirm the object storage location in question before building your model, use schema discovery to return the necessary SQL queries to create the table from the desired location. It’s also a beneficial tool to identify columns and data types, instead of spending time scouring the internal documentation that may not have the updated table version.
Catalog Explorer
Running your specific dbt model in your designated project, you will see the results populate in Starburst Galaxy. As a new model is created, how can the data producer understand who is using the model? This information is critical so that relevant data is available to data consumers in the organization.
With Catalog explorer, one can easily see the new models in Starburst Galaxy and also understand if anyone in the organization is using the model or not. As models run, the information becomes available to who ran the model, which query was executed, and when the query was initiated. All these metrics are captured in the context of the schema and tables. One can see raw and curated datasets in one place to have a complete picture of the data, and this helps the data consumers to drive insights without sacrificing discoverability.
Granular Access Control
Admin can view who can access the new model and add and change access based on the model’s maturity and if the model is ready to be shared across the organization. Admin can audit any changes in the permissions with the context of the data to ensure the quality and privileges.
If you are interested in trying these features for yourself, sign up for Starburst Galaxy. Stay tuned as we continue to improve the data discovery process. Knowledge is power!
Get Started with Starburst Galaxy today
Sign up for Starburst Galaxy and get up to 500 credits free.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).