1. Tutorial overview
Last Updated: 2024-08-15
Background
PyStarburst is a library that brings Python DataFrames to Starburst. If you're a data analyst or data scientist, this means that you can choose to use SQL or Python to analyze your data, all while taking advantage of the Trino distributed query engine.
Scope of tutorial
This tutorial will guide you through setting up a schema and tables in Starburst Galaxy. It will then show you how to use PyStarburst to answer some questions about the data. The equivalent SQL commands will be provided for reference.
Learning objectives
Once you've completed this tutorial, you will be able to:
- Use SQL to create a schema and tables in Starburst Galaxy
- Use PyStarburst to analyze data in Starburst Galaxy
Prerequisites
- You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
- You must complete the tutorial Configure a Starburst Galaxy data lake catalog and schema to prepare your environment for this tutorial.
- You must have PyStarburst installed on your computer before beginning this tutorial. If you need help setting this up, we have a tutorial to guide you through the process.
2. Prepare environment
Background
In one of the prerequisites to this tutorial, you created a catalog called tmp_cat
which you connected to an S3 bucket owned by Starburst. The S3 bucket that you connected to also contains some flight data, which you will use in this tutorial. Before you can analyze the data, you'll need to complete a few tasks to prepare your environment.
First, you'll have to add a location privilege to the accountadmin role in Starburst Galaxy to ensure that you can write to the folder within the S3 bucket that contains the flight data. After that, you'll use the Query editor to execute some SQL to create the necessary schema and tables.
Step 1: Sign in to Starburst Galaxy and confirm role
- Log in to your Starburst Galaxy account.
Your current role is listed in the top right-hand corner of the screen.
- Check your role, to ensure that it is set to accountadmin.
- If it is set to anything else, use the drop-down menu to select the correct role.
Step 2: Add location privilege to accountadmin
You must add a location privilege for the accountadmin role to be able to write to the S3 bucket location. If you don't, you'll get an error when you try to create your schema.
- Expand the left-hand navigation menu, and select Access>>Roles and privileges.
- Select the accountadmin role.
- Select the Privileges tab.
- Click the Add privilege button.
- Select the Location tab.
- Copy and paste the following into the Storage location name box:
s3://starburst-tutorials/*
- Click the Save privileges button.
Step 3: Create schema
Now that you've added the location privilege, you're ready to create a schema for the aviation data.
- From the left-hand navigation menu, select Query>>Query editor.
- Run the following SQL in the Query editor. Note that we are specifying the location of the schema:
CREATE SCHEMA IF NOT EXISTS "tmp_cat"."aviation" WITH (location = 's3://starburst-tutorials/projects/aviation/')
Step 4: Create raw_flight
table
The data you'll be working with comprises four csv files: flights.csv
, airports.csv
, carriers.csv
, and plane-data.csv
. You're going to make one table for each file, beginning with flights.csv
. Review the following ERD to understand the tables and their logical relationships.
- Run the following SQL in the Query editor to create the
raw_flight
table.
CREATE TABLE tmp_cat.aviation.raw_flight (
month smallInt,
day_of_month smallInt,
day_of_week smallInt,
dep_time smallInt,
arr_time smallInt,
unique_carrier varchar(15),
flight_number smallInt,
tail_number varchar(15),
elapsed_time smallInt,
air_time smallInt,
arr_delay smallInt,
dep_delay smallInt,
origination varchar(15),
destination varchar(15),
distance smallInt,
taxi_in smallInt,
taxi_out smallInt,
cancelled varchar(15),
cancellation_code varchar(15),
diverted varchar(15)
) WITH (
external_location = 's3://starburst-tutorials/aviation/flights',
format = 'TEXTFILE',
textfile_field_separator = ','
);
Step 5: Create raw_carrier
table
Next up is the raw_carrier
table. The SQL for this one is much shorter.
- Run the following SQL in the Query editor to create the
raw_carrier
table.
CREATE TABLE tmp_cat.aviation.raw_carrier (
code varchar(15),
description varchar(150)
) WITH (
external_location = 's3://starburst-tutorials/aviation/carriers',
format = 'TEXTFILE',
textfile_field_separator = ','
);
Step 6: Create raw_airport
table
- Run the following SQL in the Query editor to create the
raw_airport
table.
CREATE TABLE tmp_cat.aviation.raw_airport (
code varchar(15),
description varchar(150),
city varchar(150),
state varchar(150),
country varchar(150),
lat decimal(10,8),
lng decimal(11,8)
) WITH (
external_location = 's3://starburst-tutorials/aviation/airports',
format = 'TEXTFILE',
textfile_field_separator = ','
);
Step 7: Create raw_plane
table
Just one more to go!
- Run the following SQL in the Query editor to create the
raw_plane
table.
CREATE TABLE tmp_cat.aviation.raw_plane (
tail_number varchar(15),
usage varchar(150),
manufacturer varchar(150),
issue_date varchar(150),
model varchar(150),
status varchar(150),
aircraft_type varchar(150),
engine_type varchar(150),
year_built smallint
) WITH (
external_location = 's3://starburst-tutorials/aviation/plane-data',
format = 'TEXTFILE',
textfile_field_separator = ','
);
3. Analyze the aviation data
Background
For this section of the tutorial, we'll be writing all of our code in a file named aviation.py
. We'll then use a terminal window to execute the code with the command python3 aviation.py
.
We will be exploring the aviation data and asking seven analytical questions. For each PyStarburst solution, we will also provide the equivalent SQL.
Step 1: Create aviation.py
and add imports
Your aviation.py
file needs to include code to import the required libraries and also connect to your Starburst Galaxy cluster. We've put that together for you – you simply have to add your Starburst Galaxy cluster details.
- Create a new file called
aviation.py
. - Copy and paste the following code into the file. Fill in your own host name, username, and password. Save the file when you are finished.
import trino
from pystarburst import Session
from pystarburst import functions as f
from pystarburst.functions import col, lag, round, row_number
from pystarburst.window import Window
db_parameters = {
"host": "<your host>",
"port": 443,
"http_scheme": "https",
"auth": trino.auth.BasicAuthentication("<your galaxy username>", "<your password>")
}
session = Session.builder.configs(db_parameters).create()
Question 1: How many rows are in the flight
table?
It's time to answer the first question. This is a simple solution whether you're using SQL or PyStarburst. In the case of PyStarburst, you just retrieve the raw_flight
table as a DataFrame and then call the count()
function.
- Add the following lines of code to the
aviation.py
file. Save the changes when complete.
allFs = session.table("tmp_cat.aviation.raw_flight")
print(allFs.count())
- Use the following command to execute the code:
python3 aviation.py
- Your solution should be 2056494.
- Here is the equivalent SQL:
SELECT count(*)
FROM tmp_cat.aviation.raw_flight;
Question 2: In which country are the majority of airports located?
This one is also relatively straightforward. After selecting the raw_airport
table, we'll use a group_by()
function and perform a count()
function on the aggregated rows. Finally, we'll order the results by the number of rows for each country and show a single result.
- Add the following lines of code to the
aviation.py
file. Save the changes when complete.
# get the whole table, aggregate & sort
mostAs = session \
.table("tmp_cat.aviation.raw_airport") \
.group_by("country").count() \
.sort("count", ascending=False)
mostAs.show(1)
- Use the following command to execute the code:
python3 aviation.py
- Your solution should be USA, with a count of 3363.
- Here is the equivalent SQL:
SELECT country, count() AS num_airports
FROM tmp_cat.aviation.raw_airport
GROUP BY country
ORDER BY num_airports DESC;
Question 3: What are the top 5 airline codes with the largest number of flights?
This one is very similar to the previous question.
- Add the following lines of code to the
aviation.py
file. Save the changes when complete.
# get the whole table, aggregate & sort
mostFs = session \
.table("tmp_cat.aviation.raw_flight") \
.group_by("unique_carrier").count() \
.rename("unique_carrier", "carr") \
.sort("count", ascending=False)
mostFs.show(5)
- Use the following command to execute the code:
python3 aviation.py
- Your solution should look similar to the following:
- Here is the equivalent SQL:
SELECT unique_carrier, count() as num_flights
FROM tmp_cat.aviation.raw_flight
GROUP BY unique_carrier
ORDER BY num_flights DESC
LIMIT 5;
Question 4: What are the names of the top 5 airlines with the largest number of flights?
This is the same as the previous question, except we're looking for the name of the airline rather than its code. We'll create a DataFrame for the raw_carrier
table to join on. Then, we'll take the code from Question 3 and chain a few more methods on it, namely the join()
.
- Add the following lines of code to the
aviation.py
file. Save the changes when complete.
# get all of the carriers
allCs = session.table("tmp_cat.aviation.raw_carrier")
# repurpose mostFs from above (or chain on it)
# to join the 2 DFs and sort the results that
# have already been grouped
top5CarrNm = mostFs \
.join(allCs, mostFs.carr == allCs.code) \
.drop("code") \
.sort("count", ascending=False)
top5CarrNm.show(5, 30)
- Use the following command to execute the code:
python3 aviation.py
- Your results should look similar to the following:
- Here is the equivalent SQL:
SELECT c.description, count() as num_flights
FROM tmp_cat.aviation.raw_flight f
JOIN tmp_cat.aviation.raw_carrier c
ON (f.unique_carrier = c.code)
GROUP BY c.description
ORDER BY num_flights DESC
LIMIT 5;
Question 5: What are the most common airplane models for flights over 1500 miles?
We'll be utilizing another join here.
- Add the following lines of code to the
aviation.py
file. Save the changes when complete.
# trimFs are flights projected & filtered
trimFs = session.table("tmp_cat.aviation.raw_flight") \
.rename("tail_number", "tNbr") \
.select("tNbr", "distance") \
.filter(col("distance") > 1500)
# trimPs are planes table projected & filtered
trimPs = session.table("tmp_cat.aviation.raw_plane") \
.select("tail_number", "model") \
.filter("model is not null")
# join, group & sort
q5Answer = trimFs \
.join(trimPs, trimFs.tNbr == trimPs.tail_number) \
.drop("tail_number") \
.group_by("model").count() \
.sort("count", ascending=False)
q5Answer.show()
- Use the following command to execute the code:
python3 aviation.py
- Your results should look similar to the following:
- Here is the equivalent SQL:
SELECT p.model, count() as num_flights
FROM tmp_cat.aviation.raw_flight f
JOIN tmp_cat.aviation.raw_plane p
ON (f.tail_number = p.tail_number)
WHERE f.distance > 1500
AND p.model IS NOT NULL
GROUP BY p.model
ORDER BY num_flights desc
LIMIT 10;
Question 6: What is the month over month percentage change in the number of flights departing from each airport?
Let's begin with the SQL solution for this question. This solution leverages Common Table Expressions (CTE) which can be conceptualized as temporary tables. We'll follow this general approach in the Python solution where the code will be explained a bit more.
SQL solution:
WITH agg_flights AS (
SELECT origination, month,
COUNT(*) AS num_flights
FROM tmp_cat.aviation.raw_flight
GROUP BY 1,2
),
change_flights AS (
SELECT origination, month, num_flights,
LAG(num_flights, 1)
OVER(PARTITION BY origination
ORDER BY month ASC)
AS num_flights_before
FROM agg_flights
)
SELECT origination, month, num_flights, num_flights_before,
ROUND((1.0 * (num_flights - num_flights_before)) /
(1.0 * (num_flights_before)), 2)
AS perc_change
FROM change_flights;
PyStarburst solution:
- We'll review the Python solution in three parts. The first emulates the creation of the
agg_flights
CTE above:
# temp DF holds counts for each originating airport
# by month
aggFlights = session.table("tmp_cat.aviation.raw_flight") \
.select("origination", "month") \
.rename("origination", "orig") \
.group_by("orig", "month").count() \
.rename("count", "num_fs")
- Next, we'll write a Window definition to create a new column showing the number of flights from the prior record in the sorted list of all flights for each specific originating airport:
# define a window specification
w1 = Window.partition_by("orig").order_by("month")
# add col to grab the prior row's nbr flights
changeFlights = aggFlights \
.withColumn("num_fs_b4", \
lag("num_fs",1).over(w1))
- Finally, we'll determine the percentage change in the number of flights from the prior month:
# add col for the percentage change
q6Answer = changeFlights \
.withColumn("perc_chg", \
round((1.0 * (col("num_fs") - col("num_fs_b4")) / \
(1.0 * col("num_fs_b4"))), 1))
q6Answer.show()
- Copy and paste all three parts of the Python code shown above into your file and execute it. Your results should look similar to the following:
Question 7: What are the top 3 routes departing from each airport?
This is another CTE solution where we will begin with the SQL solution and use that as a guide for the Python solution.
SQL solution:
WITH popular_routes AS (
SELECT origination, destination,
COUNT(*) AS num_flights
FROM raw_flight
GROUP BY 1, 2
),
ranked_routes AS (
SELECT origination, destination,
ROW_NUMBER()
OVER(PARTITION BY origination
ORDER BY num_flights DESC)
AS rank
FROM popular_routes
)
SELECT origination, destination, rank
FROM ranked_routes
WHERE rank <= 3
ORDER BY origination, rank;
PyStarburst solution:
- We'll again review the Python solution in three parts. The first emulates the creation of the
popular_routes
CTE above:
# determine counts from orig>dest pairs
popularRoutes = session \
.table("tmp_cat.aviation.raw_flight") \
.rename("origination", "orig") \
.rename("destination", "dest") \
.group_by("orig", "dest").count() \
.rename("count", "num_fs")
- Next, we'll create a Window definition that will help create a ranking value for all flights for an originating airport, sorted by the number of flights for each combination:
# define a window specification
w2 = Window.partition_by("orig") \
.order_by(col("num_fs").desc())
# add col to put the curr row's ranking in
rankedRoutes = popularRoutes \
.withColumn("rank", \
row_number().over(w2))
- Finally, we'll remove any ranking greater than 3 and sort to show the top values for each originating airport:
# just show up to 3 for each orig airport
q7Answer = rankedRoutes \
.filter(col("rank") <= 3) \
.sort("orig", "rank")
q7Answer.show(17);
- Copy and paste all three parts of the Python code shown above into your file and execute it. Your results should look similar to the .following:
4. Tutorial wrap-up
Tutorial complete
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've learned a bit about how to use PyStarburst for data analysis, we encourage you to explore the API documentation and write your own code.
Continuous learning
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Next steps
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Tutorials available
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!