Last Updated: 2024-08-14

Background

PyStarburst is a library that brings Python DataFrames to Starburst. It is designed to make building complex data pipelines easy and to allow Python developers to easily migrate their existing PySpark and Snowpark workloads to Starburst.

Scope of tutorial

This tutorial will walk you through setting up PyStarburst on your local machine and running some basic PyStarburst code against a Starburst Galaxy cluster.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

Background

As the name PyStarburst suggests, you need Python installed. You'll also need pip, the Python package manager. These instructions will use brew for installation.

Step 1: Install Python

brew install python
python3 –version

Step 2: Set up pip

python3 -m ensurepip
python3 -m pip install –upgrade pip
pip –version

Step 3: Install PyStarburst

Now that you've got pip installed, you can use it to install PyStarburst.

pip install pystarburst

Background

Starburst Galaxy provides a connection file for easy connection to your preferred cluster. All you need to do is download the file from your Starburst Galaxy account and edit the credentials to make it unique to you.

Step 1: Navigate to Partner connect in Starburst Galaxy

Step 2: Download PyStarburst connection file

Step 3: Edit the PyStarburst connection file

The file you just downloaded has your user and host information prefilled. You just have to supply the password associated with your Starburst Galaxy account. You'll also add some code to import required libraries and run a simple command to show the tpch.tiny.region table.

from pystarburst import functions as f
from pystarburst.functions import col
session.table("tpch.tiny.region").show()

Step 4: Test the connection file

It's time to test the boilerplate code in the connection file. You can execute this code right in your terminal window.

python3 main.py

Step 5: Review Query insights in Starburst Galaxy

You just used the API to run a SELECT statement on the tpch.tiny.region table. In fact, if you take a look at Query insights in your Starburst Galaxy account, you'll see that the query was run.

Background

Thus far you've connected to your Starburst Galaxy cluster and used the PyStarburst API to run a simple SELECT * on sample tpch data. In this section, you'll continue exploring the API library and become familiar with some of the commands available to you. Please note that all of the code you write will go in the main.py file.

Step 1: Select a full table

You already saw how to use the show() function to select a table in the last section of this tutorial. This time you're going to also use the table() function to first grab the tpch.tiny.customer table before using the show() function to display it. Note: if no argument is supplied to the show() function, it defaults to listing ten rows.

custDF = session.table("tpch.tiny.customer")
custDF.show()
python3 main.py

Step 2: Use projection

As promised, we can use the select() method to limit the number of columns returned from the DataFrame. There is a compensatory drop() function that would be better if we wanted to keep most of the columns and only remove a few.

projectedDF = custDF.select(custDF.name, custDF.acctbal, custDF.nationkey)
projectedDF.show()

Step 3: Filter the rows

You can use the filter() function to filter the rows returned. In our case, we will use it to return the customer records with the highest account balance values.

filteredDF = projectedDF.filter(projectedDF.acctbal > 9900.0)
filteredDF.show(100)

Step 4: Select a second table

Later, we are going to join the customer records to the nation table to get the name of the country, not just a key value for it. In this step, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.

nationDF = session.table("tpch.tiny.nation") \
                  .drop("regionkey", "comment") \
                  .rename("name", "nation_name") \
                  .rename("nationkey", "n_nationkey")
nationDF.show()

Step 5: Join two tables

Now we can use the join() function to join two DataFrames using their nationkey values.

joinedDF = filteredDF.join(nationDF, filteredDF.nationkey == nationDF.n_nationkey)
joinedDF.show()

Step 6: Project joined result

We can use the drop() function again to clean up the unwanted columns in our join.

projectedJoinDF = joinedDF.drop("nationkey").drop("n_nationkey")
projectedJoinDF.show()

Step 7: Sort the result

Let's use the sort() method to sort the projectedJoinDF by account balance.

orderedDF = projectedJoinDF.sort(col("acctbal"), ascending=False)
orderedDF.show()

Background

We've created multiple DataFrame objects in this tutorial for educational purposes. However, in practice most DataFrame API programmers chain many methods together. For example, the following code would produce the same result as the steps in the previous section.

nationDF = session.table("tpch.tiny.nation") \
            .drop("regionkey", "comment") \
            .rename("name", "nation_name") \
            .rename("nationkey", "n_nationkey")
 
apiSQL = session.table("tpch.tiny.customer") \
            .select("name", "acctbal", "nationkey") \
            .filter(col("acctbal") > 9900.0) \
            .join(nationDF, col("nationkey") == nationDF.n_nationkey) \
            .drop("nationkey").drop("n_nationkey") \
            .sort(col("acctbal"), ascending=False)
apiSQL.show()

There is a lot more going on with the PyStarburst implementation including the lazy execution model that the DataFrame API is known for. In a nutshell, this simply means that the program waits until it absolutely needs to run some code on the Starburst Galaxy engine.

If we replace all of the code we have written so far with the code above, only a single SQL statement will be sent to Starburst Galaxy — and again, you can find it in the Query insights page.

Step 1: Execute the edited code

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've learned a bit about how to use PyStarburst, we encourage you to explore the API documentation and write your own code.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.