How to use Starburst and Airflow to create resilient data pipelines
Share
More deployment options
we executed a single DAG that executed a series of different tasks. This involved using the Starburst Galaxy engine to query data, printing the number of records returned in the query (using python) and finally execute a data quality check via the SQLColumnCheckOperator.A good data pipeline is one that you forget about. It operates quietly, reliably, in the background. However, as any experienced data engineer knows that’s rarely the case – especially if it was a pipeline built with an early version of Trino.
Trino was originally designed for fast, interactive queries, but struggled to support batch and ETL workloads due to memory constraints and the intolerance for individual task failures. This meant that when batch jobs were executed successfully Trino was faster than traditional ETL engines like Spark, but Trino also was more costly to run due to poor reliability.
The introduction of fault-tolerant execution in Trino, and Starburst, removes this limitation – making data pipelines fast and reliable.
How does fault-tolerant execution mode work?
With fault-tolerant execution mode, you get advanced resource-aware task scheduling and granular retries at the task and query level that enable you to confidently create data pipelines.
When queries are run on a FTE-enabled cluster, query retry functionality is made possible by intermediate exchange data that is spooled and can be re-used by multiple different worker nodes in the cluster. Thus, even when a single query requires more memory than is currently available within the cluster, it is still able to succeed.
Used alongside an orchestration tool like Apache Airflow, you are able to streamline the management and monitoring of your data pipelines.
The following blog will walk you through how to build resilient data pipelines with Airflow and Starburst Galaxy (the fastest and easiest way to use Trino) by leveraging Docker.
Step One: Set up Starburst Galaxy and Airflow
Make sure you have the following tools installed on your machine before proceeding.
- Python installed on your machine.
- Docker installed on your machine.
- Git installed on your machine.
- Access to Starburst Galaxy with a cluster running in “Fault Tolerant” mode and the “tpch” catalog attached. Sign up for free here.
Installing Apache Airflow
- Clone the github repository:
git clone https://github.com/YCat33/galaxy_airflow_quickstart.git
2. Change into the git repository directory:
cd galaxy_airflow_quickstart
3. Navigate to your Galaxy domain
4. Locate your connection variables by navigating to the Clusters page in the Galaxy UI
5. Click “Connection info” button on the right to view your connection information
6. For purposes of this demo, we are going to use the connection strings specific to “Python”. Run the below to allow the bash setup script to be executable:
chmod +x setup.sh
7. Run the bash script below by leveraging the connection variables from step 5:
./setup.sh '<host>' '<user>' '<password>'
This script performs the following steps:
- Runs the encode_special_chars script that sets the connection parameters to the necessary format (e.g. replacing “@’ with “%40”).
- Executes the DockerFile to build the image, which involves installing the “apache-airflow-providers-trino” package and setting up the Galaxy connection (These variables are used within the Docker-Compose.yaml file to instantiate a connection to Starburst Galaxy – see line 75 here.)
- Deploys the necessary Docker containers based off the docker-compose file
8. Access the Airflow UI by navigating to “`localhost:8080“` in your browser and login using “airflow” as the username and password.
Step Two: Run a DAG (Directed Acyclic Graph)
- From the Airflow UI homescreen, you should see a single DAG titled “starburst-galaxy-example”.
2. Click the “play” button on the right-hand side of the screen to trigger the DAG
This DAG, which is a collection of individual tasks and their dependencies, will execute the following set of tasks by leveraging three different operators.
- Task 1 (select_star) uses the TrinoOperator to execute a SQL select statement. It counts the number of records in the “tpch.tiny.customer” table and stores the result.
- Task 2 (print_number) is a PythonOperator that calls the print_command method. It retrieves the return value from Task 1 and prints it to the logs.
- Task 3 (data_validation_check) is an SQLColumnCheckOperator that performs a data quality check. It verifies that the distinct values in the “custkey” column of the “tpch.tiny.customer” table are equal to 1500.
- Click the corresponding box and selecting “Log” from the top menu to see the logs for each task
- Check out the logs for the data_validation_check task to verify that our Data Quality check has completed successfully.
Get started with Airflow and Starburst Galaxy
In this example, we executed a single DAG that executed a series of different tasks. This involved using the Starburst Galaxy engine to query data, printing the number of records returned in the query (using python) and finally execute a data quality check via the SQLColumnCheckOperator.
Hopefully you see how easily you can combine fault-tolerant functionality in Starburst Galaxy, with Python scripting and data quality check operators in Airflow to create a complete data solution. I encourage you to try it out for yourself, and see how powerful Airflow and Starburst Galaxy can be for ETL/ELT workloads.
Try Starburst Galaxy today
The analytics platform for your data lake