Offload your cloud data warehouse workloads

Strategy
  • Tom Nats

    Tom Nats

    Director of Customer Solutions

    Starburst

Share

Snowflake is a terrific cloud data warehouse (CDW). They were the first to provide an easy-to-use, autoscaling, high-performant analytical platform in the cloud. Companies flocked from their on-prem warehouse appliances to the cloud and they haven’t looked back since.

As companies have grown more comfortable with their applications and their analytics being served out of the cloud, there have been challenges around cost and the flexibility that cloud data warehouse provides. When all of your data is in a single cloud warehouse, you are at the mercy of that vendor and are unable to take advantage of new technologies and control your costs.

Starburst Galaxy provides an analytical platform that can provide the following:

  1. Extract and process data on your data lake
  2. Provide the fastest, highest concurrent query engine on your data lake
  3. Federate data from many different sources in real-time using one of the many available connectors

All this is in an open data lake architecture. Your data isn’t held hostage and lives in your account and you can pick and choose the engine that suits your needs. 

Landing, processing and serving up this data from a single storage location provides numerous benefits such as:

  • Total ownership of your data
  • Less “surface area” meaning the more you copy data, the less secure it is
  • Choose the engine for your use cases and needs
  • Avoiding vendor and storage lock-in

In this blog post, I will discuss the two ways Starburst Galaxy can augment or offload your CDW. 

First, we’ll cover how Starburst Galaxy augments your current CDW.

Augmenting your cloud data warehouse (federation)

Companies often stage their data in cloud object storage before copying it to a cloud data warehouse. Not all of this data typically makes it into the CDW. Additionally, there are usually other data sources that contain data that users would like to join with the data in their CDW. 

Starburst Galaxy contains a wide range of connectors to relational and non-relational data sources such as PostgreSQL, Mongo, and Elasticsearch. There are also connectors to cloud cdw systems such as Snowflake, Redshift, BigQuery and Synapse. 

Data is joined in real-time between these systems using our SQL cost-based optimizer. This allows standard SQL to be used across different systems. 

Example:

SELECT 
snowflake.customer_region,
sqlserver.product_type,
sum(s3.total_sales) total_sales
FROM 
snowflake, s3, sqlserver
WHERE
snowflake.customer_id = s3.customer_id
AND sqlserver.product_id = s3.product_id
GROUP BY
customer_region,
product_type; 

This allows data to be joined from object storage, relational, and non-relational sources to the CDW. Performing real-time analysis of data across data stores without needing to copy this data into the CDW saves time, money and allows for quicker insights. 

Offloading workloads from your expensive cloud data warehouse

Two of the most common things we hear from companies that went “all-in” on a CDW for their organization are:

  1. The cost has risen to out-of-control levels and we must do something about it
  2. We feel like our data is locked in and we are unable to take advantage of the many existing and new technologies

With Starburst Galaxy, you can land your data in any cloud object store, process it through the traditional layers and serve up those objects to a variety of end users with a variety of use cases. The best part is the data is stored in open formats such as parquet and orc and it’s located in YOUR cloud account providing you with the ultimate flexibility to use any engine you want to provide analytics on your data. 

One of the biggest misconceptions we continue to hear from companies that love the idea of an open data lake is “how do I create tables on my files in my cloud storage?”. This is a valid argument and is mostly a leftover from Hadoop days where Hive tables were usually large, monolithic structures and people were taught “joins are bad”. 

Let’s take the industry standard TPC-H benchmark for example. The table diagram is just a standard traditional database ERD with tables. You would create these tables, insert, update, merge and even delete data in them like a normal database.

Additionally, joining tables is fully recommended again just like a regular database:

SELECT FIRST 10
l_orderkey,
 SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM
customer, orders, lineitem
WHERE
c_mktsegment = 'BUILDING'
AND
c_custkey = o_custkey
AND
l_orderkey = o_orderkey
AND
o_orderdate < MDY(3, 15, 1995)
AND
l_shipdate > MDY(3, 15, 1995)
GROUP BY
l_orderkey, o_orderdate, o_shippriority
ORDER BY
revenue DESC, o_orderdate

The best part is Starburst Galaxy is built upon the open source Trino engine developed at Facebook to handle 1000s of concurrent users across any BI tool and is being used at some of the top companies in the world

Here is a handy feature matrix showing how using Starburst Galaxy to build your open data lake on your cloud storage provides not only the same benefits as a CDW, it’s completely open so you can plug in other engines if and when needed: (see this blog for more information on a multi-engine data lake)

Features Matrix Open Data Lake CDW
Creating SQL tables
Updating data
Deleting data
High performance queries and joins
Data sharing
Multi-engine

Now is the time to turn your data swamp into an open, well structured, high performing, open data lake that can serve ALL of your analytical use cases out of a single storage platform in your account. Sounds too good to be true doesn’t it?

If you have any questions, please feel free to reach out to us. We have also launched Starburst Academy with many free courses including our Data Foundations, our self-paced, hands-on learning course which covers data lakes extensively. 

 

Try Starburst Galaxy

Sign up, receive $500 in credits, and don’t forget to use the free cluster to run through our awesome tutorials

Start today