Last Updated: 2024-08-14

Background

PyStarburst is a library that brings Python DataFrames to Starburst. It is designed to make building complex data pipelines easy and to allow Python developers to easily migrate their existing PySpark and Snowpark workloads to Starburst.

Scope of tutorial

This tutorial will walk you through setting up PyStarburst on your local machine and running some basic PyStarburst code against a Starburst Galaxy cluster.

Learning objectives

Once you've completed this tutorial, you will be able to:

Install PyStarburst on your local machine
Use PyStarburst to run basic queries against a Starburst Galaxy cluster

Prerequisites

You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.

Background

As the name PyStarburst suggests, you need Python installed. You'll also need pip, the Python package manager. These instructions will use brew for installation.

Step 1: Install Python

Open a terminal window.
Run the following command to install the latest version of Python:

brew install python

Run the following command to check that the installation was successful:

python3 –version

Step 2: Set up `pip`

Run the following commands, one at a time to install pip and check the installation:

python3 -m ensurepip
python3 -m pip install –upgrade pip
pip –version

Step 3: Install PyStarburst

Now that you've got pip installed, you can use it to install PyStarburst.

Run the following command to install PyStarburst:

pip install pystarburst

Background

Starburst Galaxy provides a connection file for easy connection to your preferred cluster. All you need to do is download the file from your Starburst Galaxy account and edit the credentials to make it unique to you.

Step 1: Navigate to Partner connect in Starburst Galaxy

Sign in to your Starburst Galaxy account.
Expand the left-hand navigation menu, and select Partner connect.

Step 2: Download PyStarburst connection file

Scroll down to the Drivers & Clients section of the Partner connect page.
Select the PyStarburst tile.
Select the cluster you want to connect to. Make sure that it contains the tpch catalog, as we will be using that catalog throughout this tutorial.
Click the Download connection file button. A file called main.py will be downloaded.

Step 3: Edit the PyStarburst connection file

The file you just downloaded has your user and host information prefilled. You just have to supply the password associated with your Starburst Galaxy account. You'll also add some code to import required libraries and run a simple command to show the tpch.tiny.region table.

Open the main.py file you downloaded.
On line 10, replace with your Starburst Galaxy account password.
Add the following lines after line 2. This will ensure the appropriate functions are imported.

from pystarburst import functions as f
from pystarburst.functions import col

Finally, replace the last line in the file with the following:

session.table("tpch.tiny.region").show()

Save the changes to the file.

Step 4: Test the connection file

It's time to test the boilerplate code in the connection file. You can execute this code right in your terminal window.

Run the following command in the terminal window to execute the code:

python3 main.py

Your results should look similar to the image below.

Step 5: Review Query insights in Starburst Galaxy

You just used the API to run a SELECT statement on the tpch.tiny.region table. In fact, if you take a look at Query insights in your Starburst Galaxy account, you'll see that the query was run.

In Starburst Galaxy, use the left-hand navigation menu to select Query>>Query insights.
You should see the query that was just run via the API at the top of the list.

Background

Thus far you've connected to your Starburst Galaxy cluster and used the PyStarburst API to run a simple SELECT * on sample tpch data. In this section, you'll continue exploring the API library and become familiar with some of the commands available to you. Please note that all of the code you write will go in the main.py file.

Step 1: Select a full table

You already saw how to use the show() function to select a table in the last section of this tutorial. This time you're going to also use the table() function to first grab the tpch.tiny.customer table before using the show() function to display it. Note: if no argument is supplied to the show() function, it defaults to listing ten rows.

Add the following two lines to the end of the main.py file, then save.

custDF = session.table("tpch.tiny.customer")
custDF.show()

Run the code again using the following command:

python3 main.py

Your results should look similar to the image below. The output isn't very pretty because the text gets wrapped in the CLI. We'll solve for this in a moment by limiting the columns that we fetch.

Step 2: Use projection

As promised, we can use the select() method to limit the number of columns returned from the DataFrame. There is a compensatory drop() function that would be better if we wanted to keep most of the columns and only remove a few.

Add the following two lines of code to your Python program, save, and rerun the program.

projectedDF = custDF.select(custDF.name, custDF.acctbal, custDF.nationkey)
projectedDF.show()

The results, similar to the image below, should look much more manageable.

Step 3: Filter the rows

You can use the filter() function to filter the rows returned. In our case, we will use it to return the customer records with the highest account balance values.

Add the following two lines of code to your Python program, save, and rerun the program. Notice that we've added an argument of 100 to the show() function.

filteredDF = projectedDF.filter(projectedDF.acctbal > 9900.0)
filteredDF.show(100)

Your results should look similar to those shown below. Notice that, even though we requested 100 records, there were only 7 that met the criteria.

Step 4: Select a second table

Later, we are going to join the customer records to the nation table to get the name of the country, not just a key value for it. In this step, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.

Add the following two lines of code to your Python program, save, and rerun the program.

nationDF = session.table("tpch.tiny.nation") \
                  .drop("regionkey", "comment") \
                  .rename("name", "nation_name") \
                  .rename("nationkey", "n_nationkey")
nationDF.show()

Your results should look similar to those shown below. Note that the rename() function simply changes the column name.

Step 5: Join two tables

Now we can use the join() function to join two DataFrames using their nationkey values.

Add the following two lines of code to your Python program, save, and rerun the program.

joinedDF = filteredDF.join(nationDF, filteredDF.nationkey == nationDF.n_nationkey)
joinedDF.show()

Your results should look similar to those shown below. Notice that the join() function did not remove unwanted columns. We have all columns from both DataFrames.

Step 6: Project joined result

We can use the drop() function again to clean up the unwanted columns in our join.

Add the following two lines of code to your Python program, save, and rerun the program.

projectedJoinDF = joinedDF.drop("nationkey").drop("n_nationkey")
projectedJoinDF.show()

Your results should look similar to those shown below.

Step 7: Sort the result

Let's use the sort() method to sort the projectedJoinDF by account balance.

Add the following two lines of code to your Python program, save, and rerun the program.

orderedDF = projectedJoinDF.sort(col("acctbal"), ascending=False)
orderedDF.show()

Your results should look similar to those shown below.

Background

We've created multiple DataFrame objects in this tutorial for educational purposes. However, in practice most DataFrame API programmers chain many methods together. For example, the following code would produce the same result as the steps in the previous section.

nationDF = session.table("tpch.tiny.nation") \
            .drop("regionkey", "comment") \
            .rename("name", "nation_name") \
            .rename("nationkey", "n_nationkey")
 
apiSQL = session.table("tpch.tiny.customer") \
            .select("name", "acctbal", "nationkey") \
            .filter(col("acctbal") > 9900.0) \
            .join(nationDF, col("nationkey") == nationDF.n_nationkey) \
            .drop("nationkey").drop("n_nationkey") \
            .sort(col("acctbal"), ascending=False)
apiSQL.show()

There is a lot more going on with the PyStarburst implementation including the lazy execution model that the DataFrame API is known for. In a nutshell, this simply means that the program waits until it absolutely needs to run some code on the Starburst Galaxy engine.

If we replace all of the code we have written so far with the code above, only a single SQL statement will be sent to Starburst Galaxy — and again, you can find it in the Query insights page.

Step 1: Execute the edited code

Copy and paste the code above into the main.py file, replacing the previous code that was written in the last section of this tutorial. Do not delete the initial import and connection code.
Save the file updates and run the Python script as you've been doing.
Your results should look the same as those in Step 7 of the previous section.
Check the Query insights to see that only one query was generated.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've learned a bit about how to use PyStarburst, we encourage you to explore the API documentation and write your own code.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!