Use PyStarburst for data analysis

32 mins remaining

1. Tutorial overview

Last Updated: 2024-08-15

Background

PyStarburst is a library that brings Python DataFrames to Starburst. If you're a data analyst or data scientist, this means that you can choose to use SQL or Python to analyze your data, all while taking advantage of the Trino distributed query engine.

Scope of tutorial

This tutorial will guide you through setting up a schema and tables in Starburst Galaxy. It will then show you how to use PyStarburst to answer some questions about the data. The equivalent SQL commands will be provided for reference.

Learning objectives

Once you've completed this tutorial, you will be able to:

  • Use SQL to create a schema and tables in Starburst Galaxy
  • Use PyStarburst to analyze data in Starburst Galaxy

Prerequisites

  • You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
  • You must complete the tutorial Configure a Starburst Galaxy data lake catalog and schema to prepare your environment for this tutorial.
  • You must have PyStarburst installed on your computer before beginning this tutorial. If you need help setting this up, we have a tutorial to guide you through the process.

2. Prepare environment

Background

In one of the prerequisites to this tutorial, you created a catalog called tmp_cat which you connected to an S3 bucket owned by Starburst. The S3 bucket that you connected to also contains some flight data, which you will use in this tutorial. Before you can analyze the data, you'll need to complete a few tasks to prepare your environment.

First, you'll have to add a location privilege to the accountadmin role in Starburst Galaxy to ensure that you can write to the folder within the S3 bucket that contains the flight data. After that, you'll use the Query editor to execute some SQL to create the necessary schema and tables.

Step 1: Sign in to Starburst Galaxy and confirm role

  • Log in to your Starburst Galaxy account.

Your current role is listed in the top right-hand corner of the screen.

  • Check your role, to ensure that it is set to accountadmin.
  • If it is set to anything else, use the drop-down menu to select the correct role.

Step 2: Add location privilege to accountadmin

You must add a location privilege for the accountadmin role to be able to write to the S3 bucket location. If you don't, you'll get an error when you try to create your schema.

  • Expand the left-hand navigation menu, and select Access>>Roles and privileges.
  • Select the accountadmin role.
  • Select the Privileges tab.
  • Click the Add privilege button.
  • Select the Location tab.
  • Copy and paste the following into the Storage location name box: s3://starburst-tutorials/*
  • Click the Save privileges button.

Step 3: Create schema

Now that you've added the location privilege, you're ready to create a schema for the aviation data.

  • From the left-hand navigation menu, select Query>>Query editor.
  • Run the following SQL in the Query editor. Note that we are specifying the location of the schema:
CREATE SCHEMA IF NOT EXISTS "tmp_cat"."aviation" WITH (location = 's3://starburst-tutorials/projects/aviation/')

Step 4: Create raw_flight table

The data you'll be working with comprises four csv files: flights.csv, airports.csv, carriers.csv, and plane-data.csv. You're going to make one table for each file, beginning with flights.csv. Review the following ERD to understand the tables and their logical relationships.

  • Run the following SQL in the Query editor to create the raw_flight table.
CREATE TABLE tmp_cat.aviation.raw_flight (
    month smallInt,
    day_of_month smallInt,
    day_of_week smallInt,
    dep_time smallInt,
    arr_time smallInt,
    unique_carrier varchar(15),
    flight_number smallInt,
    tail_number varchar(15),
    elapsed_time smallInt,
    air_time smallInt,
    arr_delay smallInt,
    dep_delay smallInt,
    origination varchar(15),
    destination varchar(15),
    distance smallInt,
    taxi_in smallInt,
    taxi_out smallInt,
    cancelled varchar(15),
    cancellation_code varchar(15),
    diverted varchar(15)
) WITH (
    external_location = 's3://starburst-tutorials/aviation/flights',
    format = 'TEXTFILE',
    textfile_field_separator = ','   
);

Step 5: Create raw_carrier table

Next up is the raw_carrier table. The SQL for this one is much shorter.

  • Run the following SQL in the Query editor to create the raw_carrier table.
CREATE TABLE tmp_cat.aviation.raw_carrier (
    code varchar(15),
    description varchar(150)
) WITH (
    external_location = 's3://starburst-tutorials/aviation/carriers',
    format = 'TEXTFILE',
    textfile_field_separator = ','   
);

Step 6: Create raw_airport table

  • Run the following SQL in the Query editor to create the raw_airport table.
CREATE TABLE tmp_cat.aviation.raw_airport (
    code varchar(15),
    description varchar(150),
    city varchar(150),
    state varchar(150),
    country varchar(150),
    lat decimal(10,8),
    lng decimal(11,8)
) WITH (
    external_location = 's3://starburst-tutorials/aviation/airports',
    format = 'TEXTFILE',
    textfile_field_separator = ','   
);

Step 7: Create raw_plane table

Just one more to go!

  • Run the following SQL in the Query editor to create the raw_plane table.
CREATE TABLE tmp_cat.aviation.raw_plane (
    tail_number varchar(15),
    usage varchar(150),
    manufacturer varchar(150),
    issue_date varchar(150),
    model varchar(150),
    status varchar(150),
    aircraft_type varchar(150),
    engine_type varchar(150),
    year_built smallint
) WITH (
    external_location = 's3://starburst-tutorials/aviation/plane-data',
    format = 'TEXTFILE',
    textfile_field_separator = ','   
);

3. Analyze the aviation data

Background

For this section of the tutorial, we'll be writing all of our code in a file named aviation.py. We'll then use a terminal window to execute the code with the command python3 aviation.py.

We will be exploring the aviation data and asking seven analytical questions. For each PyStarburst solution, we will also provide the equivalent SQL.

Step 1: Create aviation.py and add imports

Your aviation.py file needs to include code to import the required libraries and also connect to your Starburst Galaxy cluster. We've put that together for you – you simply have to add your Starburst Galaxy cluster details.

  • Create a new file called aviation.py.
  • Copy and paste the following code into the file. Fill in your own host name, username, and password. Save the file when you are finished.
import trino
from pystarburst import Session
from pystarburst import functions as f
from pystarburst.functions import col, lag, round, row_number
from pystarburst.window import Window
 
db_parameters = {
    "host": "<your host>",
    "port": 443,
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication("<your galaxy username>", "<your password>")
}
session = Session.builder.configs(db_parameters).create()

Question 1: How many rows are in the flight table?

It's time to answer the first question. This is a simple solution whether you're using SQL or PyStarburst. In the case of PyStarburst, you just retrieve the raw_flight table as a DataFrame and then call the count() function.

  • Add the following lines of code to the aviation.py file. Save the changes when complete.
allFs = session.table("tmp_cat.aviation.raw_flight")
print(allFs.count())
  • Use the following command to execute the code:
python3 aviation.py
  • Your solution should be 2056494.
  • Here is the equivalent SQL:
SELECT count(*) 
  FROM tmp_cat.aviation.raw_flight;

Question 2: In which country are the majority of airports located?

This one is also relatively straightforward. After selecting the raw_airport table, we'll use a group_by() function and perform a count() function on the aggregated rows. Finally, we'll order the results by the number of rows for each country and show a single result.

  • Add the following lines of code to the aviation.py file. Save the changes when complete.
# get the whole table, aggregate & sort
mostAs = session \
    .table("tmp_cat.aviation.raw_airport") \
    .group_by("country").count() \
    .sort("count", ascending=False)
mostAs.show(1)
  • Use the following command to execute the code:
python3 aviation.py
  • Your solution should be USA, with a count of 3363.
  • Here is the equivalent SQL:
SELECT country, count() AS num_airports
  FROM tmp_cat.aviation.raw_airport
 GROUP BY country
 ORDER BY num_airports DESC;

Question 3: What are the top 5 airline codes with the largest number of flights?

This one is very similar to the previous question.

  • Add the following lines of code to the aviation.py file. Save the changes when complete.
# get the whole table, aggregate & sort
mostFs = session \
    .table("tmp_cat.aviation.raw_flight") \
    .group_by("unique_carrier").count() \
    .rename("unique_carrier", "carr") \
    .sort("count", ascending=False)
mostFs.show(5)
  • Use the following command to execute the code:
python3 aviation.py
  • Your solution should look similar to the following:

  • Here is the equivalent SQL:
SELECT unique_carrier, count() as num_flights
  FROM tmp_cat.aviation.raw_flight
 GROUP BY unique_carrier 
 ORDER BY num_flights DESC
 LIMIT 5;

Question 4: What are the names of the top 5 airlines with the largest number of flights?

This is the same as the previous question, except we're looking for the name of the airline rather than its code. We'll create a DataFrame for the raw_carrier table to join on. Then, we'll take the code from Question 3 and chain a few more methods on it, namely the join().

  • Add the following lines of code to the aviation.py file. Save the changes when complete.
# get all of the carriers
allCs = session.table("tmp_cat.aviation.raw_carrier")
 
# repurpose mostFs from above (or chain on it) 
#   to join the 2 DFs and sort the results that
#   have already been grouped
top5CarrNm = mostFs \
    .join(allCs, mostFs.carr == allCs.code) \
    .drop("code") \
    .sort("count", ascending=False)
top5CarrNm.show(5, 30)
  • Use the following command to execute the code:
python3 aviation.py
  • Your results should look similar to the following:

  • Here is the equivalent SQL:
SELECT c.description, count() as num_flights
  FROM tmp_cat.aviation.raw_flight  f 
  JOIN tmp_cat.aviation.raw_carrier c
    ON (f.unique_carrier = c.code)
 GROUP BY c.description 
 ORDER BY num_flights DESC
 LIMIT 5;

Question 5: What are the most common airplane models for flights over 1500 miles?

We'll be utilizing another join here.

  • Add the following lines of code to the aviation.py file. Save the changes when complete.
# trimFs are flights projected & filtered
trimFs = session.table("tmp_cat.aviation.raw_flight") \
    .rename("tail_number", "tNbr") \
    .select("tNbr", "distance") \
    .filter(col("distance") > 1500) 
 
# trimPs are planes table projected & filtered
trimPs = session.table("tmp_cat.aviation.raw_plane") \
    .select("tail_number", "model") \
    .filter("model is not null")
 
# join, group & sort
q5Answer = trimFs \
    .join(trimPs, trimFs.tNbr == trimPs.tail_number) \
    .drop("tail_number") \
    .group_by("model").count() \
    .sort("count", ascending=False) 
q5Answer.show()
  • Use the following command to execute the code:
python3 aviation.py
  • Your results should look similar to the following:

  • Here is the equivalent SQL:
SELECT p.model, count() as num_flights
  FROM tmp_cat.aviation.raw_flight f 
  JOIN tmp_cat.aviation.raw_plane  p
    ON (f.tail_number = p.tail_number)
 WHERE f.distance > 1500
   AND p.model IS NOT NULL
 GROUP BY p.model
 ORDER BY num_flights desc
 LIMIT 10;

Question 6: What is the month over month percentage change in the number of flights departing from each airport?

Let's begin with the SQL solution for this question. This solution leverages Common Table Expressions (CTE) which can be conceptualized as temporary tables. We'll follow this general approach in the Python solution where the code will be explained a bit more.

SQL solution:

WITH agg_flights AS (
SELECT origination, month, 
       COUNT(*) AS num_flights
  FROM tmp_cat.aviation.raw_flight
 GROUP BY 1,2
),
  
change_flights AS (
SELECT origination, month, num_flights,
       LAG(num_flights, 1)
         OVER(PARTITION BY origination
                ORDER BY month ASC) 
           AS num_flights_before
  FROM agg_flights
)
  
SELECT origination, month, num_flights, num_flights_before,
       ROUND((1.0 * (num_flights - num_flights_before)) / 
             (1.0 * (num_flights_before)), 2)
          AS perc_change
  FROM change_flights;

PyStarburst solution:

  • We'll review the Python solution in three parts. The first emulates the creation of the agg_flights CTE above:
# temp DF holds counts for each originating airport 
#   by month
aggFlights = session.table("tmp_cat.aviation.raw_flight") \
    .select("origination", "month") \
    .rename("origination", "orig") \
    .group_by("orig", "month").count() \
    .rename("count", "num_fs")
  • Next, we'll write a Window definition to create a new column showing the number of flights from the prior record in the sorted list of all flights for each specific originating airport:
# define a window specification
w1 = Window.partition_by("orig").order_by("month")
 
# add col to grab the prior row's nbr flights
changeFlights = aggFlights \
    .withColumn("num_fs_b4", \
        lag("num_fs",1).over(w1))
  • Finally, we'll determine the percentage change in the number of flights from the prior month:
# add col for the percentage change
q6Answer = changeFlights \
    .withColumn("perc_chg", \
        round((1.0 * (col("num_fs") - col("num_fs_b4")) / \
              (1.0 * col("num_fs_b4"))), 1))
q6Answer.show()
  • Copy and paste all three parts of the Python code shown above into your file and execute it. Your results should look similar to the following:

Question 7: What are the top 3 routes departing from each airport?

This is another CTE solution where we will begin with the SQL solution and use that as a guide for the Python solution.

SQL solution:

WITH popular_routes AS (
SELECT origination, destination,
       COUNT(*) AS num_flights
  FROM raw_flight
 GROUP BY 1, 2
),
  
ranked_routes AS (
SELECT origination, destination,
       ROW_NUMBER() 
         OVER(PARTITION BY origination 
               ORDER BY num_flights DESC) 
           AS rank
  FROM popular_routes
)
  
SELECT origination, destination, rank
  FROM ranked_routes
  WHERE rank <= 3
  ORDER BY origination, rank;

PyStarburst solution:

  • We'll again review the Python solution in three parts. The first emulates the creation of the popular_routes CTE above:
# determine counts from orig>dest pairs
popularRoutes = session \
    .table("tmp_cat.aviation.raw_flight") \
    .rename("origination", "orig") \
    .rename("destination", "dest") \
    .group_by("orig", "dest").count() \
    .rename("count", "num_fs")
  • Next, we'll create a Window definition that will help create a ranking value for all flights for an originating airport, sorted by the number of flights for each combination:
# define a window specification
w2 = Window.partition_by("orig") \
    .order_by(col("num_fs").desc())
 
# add col to put the curr row's ranking in
rankedRoutes = popularRoutes \
    .withColumn("rank", \
        row_number().over(w2))
  • Finally, we'll remove any ranking greater than 3 and sort to show the top values for each originating airport:
# just show up to 3 for each orig airport
q7Answer = rankedRoutes \
    .filter(col("rank") <= 3) \
    .sort("orig", "rank")
q7Answer.show(17);
  • Copy and paste all three parts of the Python code shown above into your file and execute it. Your results should look similar to the .following:

4. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've learned a bit about how to use PyStarburst for data analysis, we encourage you to explore the API documentation and write your own code.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.

Functional/Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.