1. Tutorial overview
Last Updated: 2024-08-14
Background
PyStarburst is a library that brings Python DataFrames to Starburst. It is designed to make building complex data pipelines easy and to allow Python developers to easily migrate their existing PySpark and Snowpark workloads to Starburst.
Scope of tutorial
This tutorial will walk you through setting up PyStarburst on your local machine and running some basic PyStarburst code against a Starburst Galaxy cluster.
Learning objectives
Once you've completed this tutorial, you will be able to:
- Install PyStarburst on your local machine
- Use PyStarburst to run basic queries against a Starburst Galaxy cluster
Prerequisites
- You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
2. Set up environment
Background
As the name PyStarburst suggests, you need Python installed. You'll also need pip
, the Python package manager. These instructions will use brew
for installation.
Step 1: Install Python
- Open a terminal window.
- Run the following command to install the latest version of Python:
brew install python
- Run the following command to check that the installation was successful:
python3 –version
Step 2: Set up pip
- Run the following commands, one at a time to install
pip
and check the installation:
python3 -m ensurepip
python3 -m pip install –upgrade pip
pip –version
Step 3: Install PyStarburst
Now that you've got pip
installed, you can use it to install PyStarburst.
- Run the following command to install PyStarburst:
pip install pystarburst
3. Connect to your Starburst Galaxy cluster
Background
Starburst Galaxy provides a connection file for easy connection to your preferred cluster. All you need to do is download the file from your Starburst Galaxy account and edit the credentials to make it unique to you.
Step 1: Navigate to Partner connect in Starburst Galaxy
- Sign in to your Starburst Galaxy account.
- Expand the left-hand navigation menu, and select Partner connect.
Step 2: Download PyStarburst connection file
- Scroll down to the Drivers & Clients section of the Partner connect page.
- Select the PyStarburst tile.
- Select the cluster you want to connect to. Make sure that it contains the
tpch
catalog, as we will be using that catalog throughout this tutorial. - Click the Download connection file button. A file called
main.py
will be downloaded.
Step 3: Edit the PyStarburst connection file
The file you just downloaded has your user and host information prefilled. You just have to supply the password associated with your Starburst Galaxy account. You'll also add some code to import required libraries and run a simple command to show the tpch.tiny.region
table.
- Open the
main.py
file you downloaded. - On line 10, replace
with your Starburst Galaxy account password. - Add the following lines after line 2. This will ensure the appropriate functions are imported.
from pystarburst import functions as f
from pystarburst.functions import col
- Finally, replace the last line in the file with the following:
session.table("tpch.tiny.region").show()
- Save the changes to the file.
Step 4: Test the connection file
It's time to test the boilerplate code in the connection file. You can execute this code right in your terminal window.
- Run the following command in the terminal window to execute the code:
python3 main.py
- Your results should look similar to the image below.
Step 5: Review Query insights in Starburst Galaxy
You just used the API to run a SELECT
statement on the tpch.tiny.region
table. In fact, if you take a look at Query insights in your Starburst Galaxy account, you'll see that the query was run.
- In Starburst Galaxy, use the left-hand navigation menu to select Query>>Query insights.
- You should see the query that was just run via the API at the top of the list.
4. Explore the PyStarburst API
Background
Thus far you've connected to your Starburst Galaxy cluster and used the PyStarburst API to run a simple SELECT * on sample tpch
data. In this section, you'll continue exploring the API library and become familiar with some of the commands available to you. Please note that all of the code you write will go in the main.py
file.
Step 1: Select a full table
You already saw how to use the show()
function to select a table in the last section of this tutorial. This time you're going to also use the table()
function to first grab the tpch.tiny.customer
table before using the show()
function to display it. Note: if no argument is supplied to the show()
function, it defaults to listing ten rows.
- Add the following two lines to the end of the
main.py
file, then save.
custDF = session.table("tpch.tiny.customer")
custDF.show()
- Run the code again using the following command:
python3 main.py
- Your results should look similar to the image below. The output isn't very pretty because the text gets wrapped in the CLI. We'll solve for this in a moment by limiting the columns that we fetch.
Step 2: Use projection
As promised, we can use the select()
method to limit the number of columns returned from the DataFrame. There is a compensatory drop()
function that would be better if we wanted to keep most of the columns and only remove a few.
- Add the following two lines of code to your Python program, save, and rerun the program.
projectedDF = custDF.select(custDF.name, custDF.acctbal, custDF.nationkey)
projectedDF.show()
- The results, similar to the image below, should look much more manageable.
Step 3: Filter the rows
You can use the filter()
function to filter the rows returned. In our case, we will use it to return the customer records with the highest account balance values.
- Add the following two lines of code to your Python program, save, and rerun the program. Notice that we've added an argument of 100 to the
show()
function.
filteredDF = projectedDF.filter(projectedDF.acctbal > 9900.0)
filteredDF.show(100)
- Your results should look similar to those shown below. Notice that, even though we requested 100 records, there were only 7 that met the criteria.
Step 4: Select a second table
Later, we are going to join the customer
records to the nation
table to get the name of the country, not just a key value for it. In this step, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.
- Add the following two lines of code to your Python program, save, and rerun the program.
nationDF = session.table("tpch.tiny.nation") \
.drop("regionkey", "comment") \
.rename("name", "nation_name") \
.rename("nationkey", "n_nationkey")
nationDF.show()
- Your results should look similar to those shown below. Note that the
rename()
function simply changes the column name.
Step 5: Join two tables
Now we can use the join()
function to join two DataFrames using their nationkey
values.
- Add the following two lines of code to your Python program, save, and rerun the program.
joinedDF = filteredDF.join(nationDF, filteredDF.nationkey == nationDF.n_nationkey)
joinedDF.show()
- Your results should look similar to those shown below. Notice that the
join()
function did not remove unwanted columns. We have all columns from both DataFrames.
Step 6: Project joined result
We can use the drop()
function again to clean up the unwanted columns in our join.
- Add the following two lines of code to your Python program, save, and rerun the program.
projectedJoinDF = joinedDF.drop("nationkey").drop("n_nationkey")
projectedJoinDF.show()
- Your results should look similar to those shown below.
Step 7: Sort the result
Let's use the sort()
method to sort the projectedJoinDF
by account balance.
- Add the following two lines of code to your Python program, save, and rerun the program.
orderedDF = projectedJoinDF.sort(col("acctbal"), ascending=False)
orderedDF.show()
- Your results should look similar to those shown below.
5. Putting it all together
Background
We've created multiple DataFrame objects in this tutorial for educational purposes. However, in practice most DataFrame API programmers chain many methods together. For example, the following code would produce the same result as the steps in the previous section.
nationDF = session.table("tpch.tiny.nation") \
.drop("regionkey", "comment") \
.rename("name", "nation_name") \
.rename("nationkey", "n_nationkey")
apiSQL = session.table("tpch.tiny.customer") \
.select("name", "acctbal", "nationkey") \
.filter(col("acctbal") > 9900.0) \
.join(nationDF, col("nationkey") == nationDF.n_nationkey) \
.drop("nationkey").drop("n_nationkey") \
.sort(col("acctbal"), ascending=False)
apiSQL.show()
There is a lot more going on with the PyStarburst implementation including the lazy execution model that the DataFrame API is known for. In a nutshell, this simply means that the program waits until it absolutely needs to run some code on the Starburst Galaxy engine.
If we replace all of the code we have written so far with the code above, only a single SQL statement will be sent to Starburst Galaxy — and again, you can find it in the Query insights page.
Step 1: Execute the edited code
- Copy and paste the code above into the
main.py
file, replacing the previous code that was written in the last section of this tutorial. Do not delete the initial import and connection code. - Save the file updates and run the Python script as you've been doing.
- Your results should look the same as those in Step 7 of the previous section.
- Check the Query insights to see that only one query was generated.
6. Tutorial wrap-up
Tutorial complete
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've learned a bit about how to use PyStarburst, we encourage you to explore the API documentation and write your own code.
Continuous learning
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Next steps
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Tutorials available
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!