In most enterprises, data is scattered across organizational and technological boundaries. This fragmented infrastructure, combined with data’s expanding volume and accelerating velocity, makes traditional data analysis techniques less effective.
Data discovery offers tremendous value in the age of big data by unlocking the enterprise’s various data sources to generate transformational business insights.
We will explain data discovery’s place in the broader data analysis context and describe traditional approaches as well as Starburst’s modern, streamlined take on data discovery.
What is the difference between data discovery and data analysis?
Data discovery is a subset of data analysis that addresses the modern challenges of big data and data sprawl. Here’s how the two differ:
By definition: Data discovery vs. data analysis
Data discovery involves the analysis of large data sets gathered from disparate sources across the enterprise. Specifically, it addresses the need for consolidation and preparation to make data accessible to data scientists, data engineers, and general business users.
By contrast, data analysis spans a much broader range of use cases, from the simple spreadsheets created at the workgroup level to department-level dashboards to advanced predictive analytics.
Purpose: Data discovery vs. data analysis
The primary objective of data discovery is to fulfill the promise of big data in data-driven business cultures. By exploring large, heterogenous datasets for previously unrecognized patterns, business users can gain insights into enterprise performance that lead to better decision-making.
On the other hand, data analysis has as many purposes as use cases, from refining staff schedules at a retail store to predicting online consumers’ shopping preferences.
Activities: Data discovery vs. data analysis
Data discovery requires collaboration between data engineers, business intelligence analysts, and decision-makers to plan a new project. This collaboration helps prioritize the technical implementation, refine the data set, and optimize the project’s output.
Data analysis can be as simple as a manager working on a spreadsheet or as complex as a team of data scientists developing artificial intelligence systems. Since most business users lack the technical skills of data scientists and engineers, most of a company’s data analysis activities skew towards the simpler end of the scale.
Outcomes: Data discovery vs. data analysis
With data arriving faster in greater quantities from many different sources, enterprises struggle to generate data insights from big data. The outcome of data discovery should be actionable insights that drive significant improvements in business processes.
The outcome of most day-to-day analysis is support for routine business activities and the creation of steady, incremental improvements in business processes.
Step-by-step guide: What are the steps of data discovery?
Every company develops a unique data discovery process reflecting its culture, organization, and data infrastructure. Generally, data discovery includes the following seven steps.
1. Define project objectives and scope
The first step in any data discovery project is to clearly define what the project’s business users want to achieve, the kind of data they need to reach that goal, and the resources available to make it happen.
However, few business users have the technical expertise or understanding of the company’s information architecture to conduct data discovery on their own. A collaborative engagement between the data team and the project’s business stakeholders is essential.
2. Identify relevant data sets
With a clear objective, the data team search the enterprise for relevant data. This process may not be straightforward. Large, long-established companies with a history of acquisitions invariably have complex information infrastructures. Even fast-growing companies accrete data storage solutions as they evolve from the startup stage.
As a result, companies have data sources scattered across business units, geographies, on-premises systems, and cloud platforms.
This storage infrastructure will use a heterogeneous mix of databases, warehouses, lakes, and other technologies that store various types of data, from structured to unstructured.
Depending on how well the company governs its infrastructure, data teams can spend considerable time hunting down data sources, understanding their structures, and determining how the data relates to the project.
3. Plan resource needs
Next, the data team must plan how to bring all that data together within the project’s time and budget constraints. This plan will address questions such as:
- Can we use existing pipelines?
- How many new pipelines must we develop?
- What compute resources do we need to prepare the incoming data?
- What storage resources do we need for the consolidated data set?
- What systems will our users need to analyze the data?
- How will we apply and enforce the company’s security and data governance policies?
With a plan in place, the data team is ready to pull everything together.
4. Cleanse and prepare the data
When ingesting data from multiple sources with different data types, schema, and formats, inconsistency is a fact of life. Data teams must cleanse the data and make it accessible to the project’s business users.
Data preparation requires extract, transform, and load (ETL) or extract, load, and transform (ELT) pipelines that copy data from the source to the project’s data warehouse. Since existing pipelines are rarely fit for purpose, the project’s data engineers need time to develop and test new pipelines.
5. Data exploration
This stage of data discovery has two objectives: reveal initial insights and understand what you don’t understand.
Data exploration relies on charts, graphs, and other data visualization tools that are accessible to business users without a lot of analytical expertise. Visual analysis helps users spot trends, patterns, and outliers within the data that inspire more nuanced analysis.
Data that directly answers a project’s questions is always valuable, but data that raises new questions can lead to more significant data insights.
While the previous three steps fell solely in the data engineers’ technical domain, data exploration is an interactive process involving the project’s data team and business users.
6. Advanced analytics
The technical limitations of working with big data typically force the exploration stage to use a fraction of the overall data set. The exploration stage’s learnings let the data team apply advanced data discovery tools to massive data sets.
Often called smart data discovery, this step applies automation, advanced analytics, machine learning algorithms, and other techniques to process massive data sets to meet the project’s objectives.
7. Document and iterate
Data discovery is not a one-time event, even if the project itself was. Data discovery is a formal process that improves through directed practice. The data team must document the project, how they resolved issues, and their performance. Continuous iteration improves data discovery practices and strengthens the company’s data-driven culture.
Streamline data discovery at the source, with Starburst
Data discovery’s reliance on the data team limits how widely companies can adopt it. Business users can’t get the answers they need without data engineers’ help to plan, ingest, prepare, and analyze data. But data engineers are in short supply, and data teams can’t help everyone.
Starburst streamlines data discovery by democratizing access to a company’s disparate data sources. Rather than feeding project-specific data warehouses through customized ETL pipelines, Starburst accesses data sources directly.
Starburst’s single point of access, combined with schema and catalog discovery tools, lets users access data with the SQL tools they already know. As a result, business intelligence analysts can perform data discovery without the data team’s help.
Starburst data discovery for data engineers
As business users switch to the self-service model Starburst enables, data engineers have more time to support complex discovery projects that could generate transformational insights.
Starburst’s unified view of the company’s data sources also streamlines these advanced discovery projects. Data engineers can run interactive and batch queries in real-time rather than develop, test, and iterate multiple data pipelines.
Starburst data discovery for data scientists
Data scientists can use Starburst’s massively parallel query engine to explore data interactively. Rather than waiting for engineers to tweak ETL pipelines, scientists can run multiple scenarios to explore the company’s data surface area.
Once scientists have thoroughly tested their models, they can bring in the data team to start ETL development for machine learning and artificial intelligence applications.