Last Updated: 2024-01-22
Starburst Galaxy allows you to easily query a variety of data sources. It can also be used to join data from multiple data sources through a single point of access.
It is especially useful when querying data lakes or joining data lakes to data warehouses as a way of enhancing efficiency and reducing cost by shifting data from traditional data warehouses towards inexpensive cloud object storage.
This tutorial will get you started using Starburst Galaxy and help demonstrate the federation of data between a data lake and data warehouse using sample data.
After creating a free Starburst Galaxy account, you can jump right in and start using Starburst Galaxy.
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
It's time to set up your Starburst Galaxy account. This is designed to be quick and easy - just a couple of minutes.
Once you're set up, you can query sample data or add your own data sources, including data lakes, or data warehouses.
Excited? Let's get started.
Starburst Galaxy is accessed through a web browser, although it can also be used in other ways, including a command line interface. This tutorial will focus on using the web UI, as this is the way that most users interact with the system.
To get started, click the URL below to begin signing up.
Starburst Galaxy accounts are registered to a particular name and email address. To set up your account, you will need to provide the following information.
A confirmation code will be sent to your email address. You will need to copy it from your email into your Starburst Galaxy browser window.
Now it's time to paste the confirmation code into Starburst Galaxy.
Starburst Galaxy accounts each include a domain name. Typically this is either the name of the user, or the department or business function controlling the domain. In either case, it should be meaningful to you and something that you will not forget.
Once selected, the domain will become part of the URL used when you sign in to Starburst Galaxy.
Now it's time to create a password. This will help protect access to your account. You should never share this password with anyone.
If you need to share access to your account with others, there are more secure ways to do this from within Starburst Galaxy. Future tutorials will demonstrate this process.
Starburst Galaxy is used by many different types of people in different ways. Knowing the ways that you'll be using it will help us meet your needs better.
Before proceeding, tell us a bit about how you are going to use Starburst Galaxy.
Starburst Galaxy's main task is to make running queries easy and intuitive, no matter where your data resides. To help accomplish this, it hits the ground running by providing a simple dataset and a few sample queries to get you started.
You're going to begin by using these sample datasets to execute your first queries.
The Query editor is where you will write and execute SQL.
customer
table from burstbank
sample dataThe SQL query that you need is already in the Query editor. This helps you hit the ground running and test your cluster quickly.
SELECT custkey, last_name, country FROM sample.burstbank.customer LIMIT 10;
custkey
, last_name
, and country
columns in the customer
table located in the burstbank
schema and sample
catalog. Now it's time to view the results of your query. Starburst Galaxy displays the results of your query at the bottom of the page.
Now you're going to run another query using sample data. The second query included with Starburst Galaxy joins data within a single catalog of sample data.
c.custkey
, c .last_name
, c.country
, and a.cc_status
from the sample.burstbank.customer
and sample.burstbank.account
tables. The results of the second query are displayed below the query editor. This query joins data from tables within the same catalog, but Starburst Galaxy makes it easy to join tables from different data sources using data federation.
The next part of this tutorial will delve into that federation use case, unlocking the power of the open data stack.
Now it's time to shift gears. To showcase the true power of Starburst Galaxy, you're going to use data federation to join two different data sources.
Many organizations store data in multiple locations for a variety of reasons. It's not uncommon for data to be scattered across data lakes, data warehouses, and databases, sometimes both in the cloud and on-prem.
For this scenario, imagine that you work as an imaginary bank that uses Starburst Galaxy, Burst Bank. This bank has data in many different locations, and it's going to be your job to join two of them.
The scenario below outlines the scenario in more detail.
Now it's time to use Starburst Galaxy to assist Burst Bank by joining all their datasets together to produce the report they need.
In this exercise, the data warehouse and data lake already exist and the necessary connection information is provided for you in this tutorial.
To do this, you will:
Burst Bank holds payment history information in their data lake. In this section you will use the Starburst Galaxy Data Lakehouse connector to access this data.
Adding a new data source to Starburst Galaxy requires creating a new catalog for that data source.
Data lake object storage is the most popular option for Starburst Galaxy data sources. For this tutorial, we're going to be using Amazon S3.
Each data source has its own credentials that limit access. You're going to add the credentials needed to access the data lake via Starburst Galaxy. This will create a catalog for that datasource which will keep track of those credentials in the future.
lakehouse_burst_bank
.AKIAYUW62MUVVQ2OP34U
qu7NRVypctO7/86OmBLgHJa64ij3k/mVuuZD2y1U
Now it's time to select the metastore configuration. Starburst Galaxy lets you choose from several options for metastore. You can mix and match metastores and data sources in whatever way makes sense for you. For this tutorial, we're going to select AWS Glue, using the US East North Virginia region.
query-plan-labs-data-external
.burst_bank
. Table formats control the way that data is stored when it enters the data lake. Today you're going to be using a modern, open table format known as Iceberg. Starburst Galaxy is also able to read other table formats like Hive and Delta Lake, but Iceberg is by far the best table format for most users.
Starburst Galaxy allows you to set permissions for every datasource. In this tutorial, we're going to set our access to the data lake to read-only and leave all other default settings in place.
Starburst Galaxy allows you to add the catalog to a cluster at this point. For this tutorial, you're going to skip this stage and continue adding data sources then add them all to a cluster later.
Burst Bank has customer information and customer account information stored in their data warehouse. In this section, you will use the PostgreSQL connector to access this data warehouse.
Connecting a data warehouse follows much the same process as connecting a data lake, both involve creating new catalogs to access the datasource.
Starburst Galaxy allows you to connect to a PostgreSQL data warehouse just like any other data source. PostgreSQL is listed in the Additional data sources section.
Starburst Galaxy allows you to access PostgreSQL data sources on all three major cloud providers: AWS, Google Cloud Platform, and Azure. For this tutorial, you're going to use AWS.
postgresql_burst_bank
.Now it's time to input the credentials that will allow Starburst Galaxy to connect to the PostgreSQL data warehouse.
external-query-plan-postgresql.cq4rq9fclcvt.us-east-1.rds.amazonaws.com
5432
.query_plan
.readonlyuser2
.$i>Uv~T,b*&qQXvN3j-"5GxSIN5v"uP.$ya7
Just like your data lake, for this tutorial you will only need read-only access. Starburst Galaxy lets you configure this as part of the setup.
You don't need to add the PostgreSQL data warehouse to a cluster yet. You'll connect both the data lake and data warehouse to the same cluster in the next section of this tutorial.
All of Burst Bank's data sources are in AWS in the US East-1 (North Virginia) region. With Starburst Galaxy, the data sources and Starburst Galaxy cluster should be in the same cloud region to reduce cloud costs.
In this step, you will create a new cluster in US East-1 (North Virginia).
Begin by creating a new cluster
Now it's time to name the cluster. This name should be something useful and indicative of its purpose and location.
aws-us-east-1-free
.Now you need to add all of the catalogs that you want associated with the new cluster.
In the case of the tutorial, you'll want to select all catalogs.
When you create a new cluster, you need to define certain key details.
The cluster will now begin starting automatically. This takes a few minutes and you can watch the process update live.
Now it's time to return to the query editor. You can do this quickly using a shortcut.
It's always a good idea to confirm that new clusters have been created successfully, and that the correct number of catalogs has been added.
aws-us-east-1-free
has been added. Note that it has already been selected and is running. With Starburst Galaxy configured, you are now ready to produce the report for Burst Bank's risk department. Remember, they need a list of customers that have more than 3 delinquent payments in their history and a FICO score of less than 500.
In this section, you will run a federated SQL statement across a data warehouse and a data lake to create the critical report the risk department needs.
Run the following SQL command to get the required information. Let's break it down to help understand what it's doing.
The query returns several fields from different data sources. On the customer side, you're looking for:
On the auto sales side, you're also returning some data, including:
Some of this data is in your lakehouse_burst_bank
data lake and some of it is in your postgresql_burst_bank
data warehouse. Returning results from two datasets requires a join, linking the customer ID and auto loan ID.
Finally, the delinquency of the account requires some conditional, boolean logic. You need to test whether the conditions for delinquency are true, and then if they are true, return the result if the delinquency is greater than 3.
These results are then aggregated using the GROUP BY
command.
SELECT
c.first_name,
c.last_name,
c.fico,
a.auto_loan_id,
a.auto_loan_open_date,
count(al.delinquent_payment) as num_delinquent_payments
FROM
postgresql_burst_bank.burst_bank_with_stats.customer c
JOIN postgresql_burst_bank.burst_bank_with_stats.account
a on c.custkey = a.custkey
JOIN lakehouse_burst_bank.burst_bank_with_stats.auto_loan_payment
al on al.auto_loan_id = a.auto_loan_id
WHERE
al.delinquent_payment = 'Y'
and c.fico < 500
GROUP BY
a.auto_loan_id,
c.first_name,
c.last_name,
a.auto_loan_open_date,
c.fico
HAVING
count(al.delinquent_payment) > 3;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use Starburst Galaxy to connect different sources of data. Do you have data sources that you would like to connect to Starburst Galaxy? Check out our other tutorial offerings to see how easy it is to connect!
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!