Starburst Data Products on AWS Accelerate Time to Insight Using Core Software Engineering Principles

Strategy
  • Vishal Singh

    Vishal Singh

    Head of Data Products

    Starburst

  • Antony Prasad Thevaraj

    Antony Prasad Thevaraj

    Senior Partner Solutions Architect

    AWS

Share

Organizations are generating increasingly large volumes of data, which are often dispersed across different locations and storage systems. Even with the presence of extensive data lakes, enterprises still rely on traditional warehouses and storage systems for important data.

To extract valuable insights that drive informed business decisions, data scientists and business intelligence analysts require access to all of these datasets. The impact of increased data access is significant.

Starburst offers real-time query access to data from multiple sources, challenging the outdated single-source-of-truth model. Starburst Galaxy caters to the modern enterprise’s reality, providing fast and efficient access to distributed datasets without the need for extract, transform, load (ETL) processes.

Starburst, an AWS Specialization Partner and AWS Marketplace Seller with the Data and Analytics Competency, is witnessing an increasing number of global organizations leveraging their data products with Amazon Web Services (AWS) analytics services and data lakes on AWS.

Among Starburst’s suite of solutions, Starburst Data Products accelerates time to insight, improves visibility, and enhances efficiency within organizations adopting or considering a data lake analytics platform. This post delves into the appeal of data products, their deployment, and the various benefits experienced by different organizations today.

Disparate data sources: Batch Data, Streaming data, Files

Taking a broad view of the data landscape in large enterprises, it becomes apparent that data resides in diverse systems, scattered across multiple regions and geographies, each with unique data governance requirements.

The traditional approach of centralizing data into a single source of truth, even within a modern data lake, can hinder timely insights and create frustrating bottlenecks.

Starburst-Data-Products-Overview-1.1

Figure 1 – Starburst Galaxy architecture.

 

This approach doesn’t align with the principles of a data-driven organization. Data lake architecture has emerged as a more attractive alternative, as it acknowledges the reality of data gravity and shifts the responsibility of curating, maintaining, and preparing data for enterprise-wide use from centralized data infrastructure teams to subject-matter experts (SMEs) who possess the best understanding of the data.

These curated datasets, known as data products, encourage SMEs to apply product thinking, ensuring reliability, easy consumption, and easy sharing. A data product can range from a simple list of transactions to a complex group of datasets.

While adopting a data lake analytics platform requires a shift in architectural thinking, one of its core principles is to simplify the lives of data engineers, producers, and consumers. This post aims to share approaches for implementing a data lake approach with minimal effort using Starburst Data Products, a solution for creating, maintaining, and sharing data products within an organization.

Starburst customers currently utilizing Amazon Simple Storage Service (Amazon S3) or data lakes on AWS can leverage Starburst products at no additional cost.

Benefits of Starburst & AWS Data Products

These features and more simplify the creation and internal sharing of data products with different teams or business units.

  • Simplicity: With APIs that facilitate easy formulation of CI/CD pipelines for data product creation and consumption, Starburst Data Products simplify the process. Additionally, these offerings can be managed in a Git repository as JSON blurbs.
  • Federated access: Starburst delivers federated access to multiple data sources through 50+ high-performance connectors, which eliminates the need for data products to originate from a single source. The Stargate feature enables connections to Starburst deployments in different clouds or regions, allowing queries across more data sources while remaining compliant.
  • Domain governance: Starburst ensures fine-grained access control, including row-level filtering and column-level masking. This enables different users within an organization diffto have distinct views of the same data product based on their roles, ensuring consistent governance from the source to the data product level.
  • Visibility: Starburst Data Products offer decentralized management with centralized visibility, providing simple, direct, and searchable access for data consumers.

Each team can consume the data based on their access levels, and data engineers specifically can combine datasets from various sources and utilize Starburst’s CI/CD pipeline to publish the same data product in different staging or production environments.

Starburst-Data-Products-Overview-1

Figure 2 – Starburst Data Products.

By giving data producers and SMEs more responsibility over the data they understand best, Starburst Data Products enable consumers to extract insights from distributed datasets with ease. Data products come with searchable details and metadata, facilitating data consumers’ access to basic usage metrics, business context, bookmarks, sample queries, and more.

The interface for creating and discovering data products remains consistent for both producers and consumers, streamlining workflows and reducing pressure on data engineers without requiring consumers to learn new skills or programming languages.

The long-term value of data products lies in their dynamic nature. Since they’re created at the query level, any changes made to data products are instantly reflected across all tools as they are published.

Curated datasets forming the existing data products can be combined with other data products or raw data. Consequently, data consumers continually reuse existing data products or find new ways to utilize them, eliminating the need for redundant rebuilding and recreating of datasets.

AWS infrastructure elasticity

By utilizing Starburst’s Stargate connector, datasets can be easily moved from one region to another, closer to the center of data gravity. The cached dataset can then be consumed using AWS infrastructure and elasticity, effectively avoiding high compute costs. We recommend watching this webinar and demo, which discuss scenarios involving connections to AWS clusters and datasets in other clouds and regions.

The ease of creating data products from distributed enterprise data is a significant advantage. This feature is available at no additional cost within Starburst, making it essential for Starburst customers with datasets in AWS to explore and leverage this exciting capability today.

Take a moment to explore Starburst on AWS through AWS Marketplace.

Originally posted on aws.amazon.com

Webinar: How Resilience is building scalable Data Products with Starburst & AWS

Learn more