How to build data transformations with Python, Ibis, and Starburst Galaxy

  • Duy Huynh

    Duy Huynh

    Solutions Architect

    Starburst

Share

Data processing and analysis are vital aspects of data science and analytics. With a variety of tools and frameworks available, it becomes essential to choose the right tools for the job. 

The combination of Starburst Galaxy with Ibis provides teams with the ability to build data-intensive applications on an open and composable backend architecture. By connecting Galaxy with Ibis, we can extract data from our cloud data lake (or other cloud data sources), process and analyze the data using Galaxy’s optimized Trino clusters, and then surface that data to our end users with Ibis. 

Ibis is a powerful Python library that simplifies analytical workflows by providing a pandas-like API. This enables data scientists to express complex data manipulations and computations with ease. Its biggest asset is the ability to plug into any analytical backend system. Trino, on the other hand, is a fast, distributed SQL query engine designed for big data. It ensures that your analytics queries are executed efficiently and rapidly. Starburst Galaxy is the fastest and easiest way to get up and running with Trino. 

In this post, we’re going to download and import NYC trip data from January 2023, and learn some Ibis functions/expressions to create new insights and tables in your data lake!

Prerequisites

Before diving in, ensure you that have the following setup in place:

  • Python 3
  • A Starburst Galaxy account
    • A data lake catalog connected to Starburst (AWS S3, ADLS Gen2, or GCS)
  • Necessary libraries:
    • trino
    • Ibis-framework[trino]
    • Jupyter

Installing the libraries:

You can install the `ibis-framework[trino]` and`trino` Python libraries using the following command in your terminal:

pip install trino

pip install ibis-framework[trino]

pip install jupyter

Setting up Starburst Galaxy:

Follow the instructions in docs to create your free Galaxy account. Ensure that you have a working Starburst cluster and can execute SQL queries using the Trino CLI.

You will also need to connect an AWS S3 or similar data lake catalog to your Galaxy cluster. Details on how to set this up can be found here

Prepare your files for querying

    Download NYC Taxi trip Parquet

    In our notebook today, we will be using the NYC Taxi trips public dataset the city kindly provides on the internet. The link is here: NYC Taxi Trip RecordsWe will be downloading the January – Yellow Taxi Trip Records file alongside the zonelookup table. 

    Note: You can use other months, but do so at your own discretion since the results will likely differ from the notebook provided today. 

    Expand 2023 and click the follow link to download the file.

    We need one more file from the NYC taxi trip records page. Scroll down until you see the follow heading and download the taxi zone lookup table.

    You should now have two files:

    1. Yellow_tripdata_2023-01.parquet
    2. taxi_zone_lookup.csv

    Upload the trip datasets to AWS S3

    In your S3 bucket (or equivalent), create a folder called nyctaxi. For this tutorial, we’re going to create two sub-folders in this folder in order for Starburst schema discovery to pick up both files as “Tables” in the Starburst metastore. This allows us to easily automate new file discovery within your data lake. If you’re using Azure Data Lake Gen2 or GCS, you can follow similar steps within your respective cloud storage systems.

    Place the files you downloaded in their respective folders and upload them!

    Use schema discovery in Starburst Galaxy

    Next, we’re going to register the files as tables. To do this, we’re going to need the S3 URI from your root folder nyctax. You can find this by using the “Copy S3 URI” button pictured below:

    Now, we’re going to go into the Starburst Galaxy domain that you created as a prerequisite to this tutorial. To navigate to schema discovery in Starburst Galaxy, following these steps:

    • Click on Catalogs on the left side of your user interface. 

    •  Next, click on the S3 catalog that you created during the set up phase. From the schema discovery tab we are going to hit Run discovery.

    • Copy your S3 URI into the correct field as shown below. Expand advanced settings and click Run full discovery.

    • The results should show two tables found. After that, hit Apply discovery and the tables will automatically be created and ready to be queried both inside and outside Starburst Galaxy!

    Connecting Ibis to Starburst Galaxy

    Once Starburst is set up, you’re ready to launch your jupyter notebook environment where we will be playing with Ibis in interactive mode. 

    Launch jupyter by typing the following command in your terminal:

    jupyter notebook

    Then, create a new notebook with python3 kernel and connect Ibis to Starburst using the following code snippet :

    import ibis

    Replace ‘user’, ‘hostname’, ‘database’, ‘schema’ and ‘your_catalog’ with your Starburst catalog details. 

    con = ibis.trino.connect(
      user="user",
      host="hostname",
      port=443,
      database="sample",
      schema="demo",
    )

    Note: If you’re an advanced user and already have OAuth2 setup in your Galaxy account. You’ll need the following additional lines in order to make the connection work – importing Outh2 authentication from Trino, alongside adding the roles and http_scheme kwargs.

    import ibis
    from trino.auth import OAuth2Authentication

    Replace ‘user’, ‘hostname’, ‘database’, ‘schema’ and ‘your_catalog’ with your Starburst catalog details. Lastly, a roles keyword argument is needed to specify access for the user coming in from Ibis. In our example, we are using accountadmin for simplicity.

    con = ibis.trino.connect(
      user="user",
      host="hostname",
      port=443,
      database="sample",
      schema="demo",
      auth=OAuth2Authentication(),
      roles=accountadmin,
      http_scheme="https"
    )

    Exploring and analyzing data

    Now that you’re set up, watch the video or download the Jupyter notebook to learn how to explore and analyze data with Ibis and Starburst.

    Open the notebook: https://duyhuynh.ca/sbibis/

    Or download the notebook here: https://github.com/dhuynh/sbibis/blob/main/notebook/ibis_starburst_NYCTaxi_final.ipynb

     

    Conclusion

    Ibis and Starburst Galaxy together form a great duo, enabling users to harness the power of Trino’s high-speed querying capabilities with Ibis’s user-friendly pandas-like API. This guide should help you kick-start your journey with Ibis and Starburst, facilitating easier and more efficient data analytics workflows.

    Other helpful resources

    Happy notebooking and exploring the realms of data + analytics with Ibis and Starburst!