
A Better Solution For Managing and Maintaining Data Pipelines, Now In Public Preview




Yusuf Cattaneo
Solutions Architect
Starburst
Yusuf Cattaneo
Solutions Architect
Starburst
Share
we executed a single DAG that executed a series of different tasks. This involved using the Starburst Galaxy engine to query data, printing the number of records returned in the query (using python) and finally execute a data quality check via the SQLColumnCheckOperator.A good data pipeline is one that you forget about. It operates quietly, reliably, in the background. However, as any experienced data engineer knows that’s rarely the case – especially if it was a pipeline built with an early version of Trino.
Trino was originally designed for fast, interactive queries, but struggled to support batch and ETL workloads due to memory constraints and the intolerance for individual task failures. This meant that when batch jobs were executed successfully Trino was faster than traditional ETL engines like Spark, but Trino also was more costly to run due to poor reliability.
The introduction of fault-tolerant execution in Trino, and Starburst, removes this limitation – making data pipelines fast and reliable.
With fault-tolerant execution mode, you get advanced resource-aware task scheduling and granular retries at the task and query level that enable you to confidently create data pipelines.
When queries are run on a FTE-enabled cluster, query retry functionality is made possible by intermediate exchange data that is spooled and can be re-used by multiple different worker nodes in the cluster. Thus, even when a single query requires more memory than is currently available within the cluster, it is still able to succeed.
Used alongside an orchestration tool like Apache Airflow, you are able to streamline the management and monitoring of your data pipelines.
The following blog will walk you through how to build resilient data pipelines with Airflow and Starburst Galaxy (the fastest and easiest way to use Trino) by leveraging Docker.
Make sure you have the following tools installed on your machine before proceeding.
Installing Apache Airflow
git clone https://github.com/YCat33/galaxy_airflow_quickstart.git
2. Change into the git repository directory:
cd galaxy_airflow_quickstart
3. Navigate to your Galaxy domain
4. Locate your connection variables by navigating to the Clusters page in the Galaxy UI
5. Click “Connection info” button on the right to view your connection information
6. For purposes of this demo, we are going to use the connection strings specific to “Python”. Run the below to allow the bash setup script to be executable:
chmod +x setup.sh
7. Run the bash script below by leveraging the connection variables from step 5:
./setup.sh '<host>' '<user>' '<password>'
This script performs the following steps:
8. Access the Airflow UI by navigating to “`localhost:8080“` in your browser and login using “airflow” as the username and password.
2. Click the “play” button on the right-hand side of the screen to trigger the DAG
This DAG, which is a collection of individual tasks and their dependencies, will execute the following set of tasks by leveraging three different operators.
In this example, we executed a single DAG that executed a series of different tasks. This involved using the Starburst Galaxy engine to query data, printing the number of records returned in the query (using python) and finally execute a data quality check via the SQLColumnCheckOperator.
Hopefully you see how easily you can combine fault-tolerant functionality in Starburst Galaxy, with Python scripting and data quality check operators in Airflow to create a complete data solution. I encourage you to try it out for yourself, and see how powerful Airflow and Starburst Galaxy can be for ETL/ELT workloads.
The analytics platform for your data lake