Apache Airflow

By integrating Apache Airflow with Starburst Data, you can leverage the powerful scheduling, monitoring, and task execution capabilities of Airflow while utilizing the distributed SQL query capabilities of Starburst Data for efficient data processing.

Airbnb developed Airflow in the early 2010s to automate ETL workloads. Now an Apache Software Foundation top-level project, Airflow provides the functionality companies of any size need to make data workflows more maintainable than conventional ETL pipelines.

This guide will introduce Apache Airflow and explain how pairing Airflow with Trino streamlines ingestion and other use cases.

How ETL Fails

Pipeline development and maintenance ranks among data engineering’s most expensive and resource-intensive activities. Even worse, the pipeline development process is a significant source of friction delaying time to insight and slowing critical business decisions.

Analytics data may be centralized, but not data architecture

Decision-makers and their analysts rely on a central data team to populate analytics systems like Snowflake with the data they need for their analyses. End users don’t know how to develop complex queries or understand the nuances of disparate data sources. However, the data teams do not own the sources providing that data — that responsibility falls within the business domains.

Inflexible pipeline development processes

When an analyst needs to answer an important business question, they must submit a request to the data team. That request will compete with all the other requests in the data team’s backlog. Once the request emerges from the queue, a data engineer works with the analyst to understand their requirements, such as the kind of data, formats, and quality. The engineer identifies appropriate data sources through exploration and discovery, developing a data model to guide the pipeline’s design. Running the completed pipeline ingests and transforms the data before loading it into the data warehouse, ready for the analyst’s use.

Even in the best-case scenario, this process takes weeks — months for complex workflows. But all too often, things don’t go as planned. For example, these pipelines run in fault-sensitive batch processes. Unanticipated issues with the source data, memory constraints, or network interruptions can trigger fatal errors that shut down the process. The entire process must run again, even if it is 99% complete. That may not happen immediately since other data workloads compete for constrained compute resources.

Should everything work perfectly, there’s a good chance that the analyst will take one look at the final dataset and realize that it wasn’t what they wanted. Now, they have to submit another request, which, due to dependencies and other complicating factors, could be as resource-intensive as the original request.

Impacts of ETL brittleness

The brittle nature of ETL-based data processing is an immense source of frustration for everyone involved. Engineers must bear the burden of constant demands on their time. Analysts chafe at the delays in their iterative data explorations. Most importantly, decision-makers get increasingly irritated when they can’t get the analyses they need to run the business.

Inflexible pipeline development processes are also exceedingly expensive. ETL pipelines are compute and storage intensive. Extraction copies data from sources to staging areas where transformation occurs. This pre-ingestion typically takes place within the data warehouse’s high-priced storage and compute resources. Costs increase every time the pipeline fails or gets tweaked to deliver “better” data.

Traditional Airflow architecture

An Airflow architecture helps to streamline pipeline management by making the process less manual.

Code-based framework

AirBnB used “configuration as code” as the guiding principle behind Airflow’s development. Python scripts define workflows, allowing data teams to develop pipelines with DevOps practices, including:

  • Shared ownership
  • Unit testing
  • Version control
  • Automation
  • Extensibility
  • Reusability

Extensibility and reusability are particular strengths. Because it’s open source and based on a widely-known programming language, community providers have created a large library of plugins. These robust integrations let data teams connect Airflow pipelines with any enterprise data source and destination. Templating lets data teams maintain internal libraries to recycle custom code and further shorten pipeline development cycles.

Airflow’s Directed Acyclic Graphs

Airflow Directed Acyclic Graphs (DAGs) manage workflow orchestration. DAGs are Python-based definition files describing a workflow’s tasks, relationships, and dependencies. Tasks can be predefined templates, called Operators, or event-based triggers, called Sensors. Airflow also lets you package Python code as a custom task.

A task’s definition includes its relationships to upstream or downstream tasks in the Airflow workflow. Tasks aren’t fully dependent on the status of their upstream relationships, allowing Airflow DAGs to support branching and conditional behavior.

Although engineers can run DAGs manually, Airflow’s advantages come from treating DAGs as code. They can run on a schedule set within the DAG itself or programmatically through Airflow’s API.

Airflow user interface

Airflow includes a web interface that lets data teams monitor their data pipelines. The UI can present many views of an Airflow architecture, including:

  • DAG listing with summary statistics,
  • Status of DAGs running in the same cluster,
  • Datasets and their relationships with DAGs, and
  • Visualizations and drill-downs of DAG activity.

Logging and alerting features let data teams monitor the health of their pipelines, rapidly surfacing errors and letting engineers resolve issues quickly.

Open-source Apache Airflow use cases

The Airflow project’s annual community survey found an overwhelming majority of organizations use Airflow to manage ETL and ELT pipelines, both for analytics and operations. A significant number rely on the framework to support data science and machine learning operations. Increasingly, companies have found Airflow useful for spinning up and spinning down infrastructure, including Kubernetes and Spark clusters. New features in Airflow 2.7 drive this last use case: special types of tasks for setup and teardown.

Traditional Airflow workflows

Airflow is designed for scalability. Engineers can run an instance on their laptops. Organizations with larger-scale use cases can deploy Airflow to their cloud infrastructure running on AWS, Microsoft Azure, or Google Cloud.

As workflows get more complex, Airflow provides several ways to orchestrate a pipeline’s tasks. Rather than running a long, fault-sensitive sequence of queries, charting lets you build a series of smaller, independent parts. That only works when you already know how to split things up. Dynamic task mapping uses runtime information to create tasks.

Modern Airflow architecture: Apache Airflow with Trino

A modern approach to pipeline management pairs Airflow with the distributed SQL query engine Trino. This open-source query engine lets organizations unify their data architectures giving engineers, data scientists, and end users a holistic view of all enterprise data.

Federation and abstraction

Centralized repositories like data warehouses never lived up to their promise of becoming the single source of truth for enterprise data. Companies generate too much data in too many formats for that to ever be possible. Trino’s connectors provide access to over fifty enterprise data sources, from relational databases to streaming systems to data warehouses.

Acting as an abstraction layer, Trino unifies these disparate sources within a single interface. Rather than aspiring to a single source of truth, companies use Trino to create a single point of access for all data, no matter where it lives.

Self-service accessibility

Trino queries use ANSI-standard SQL. Freed from proprietary or obscure technologies, data consumers with SQL abilities can query any source on their own. Non-technical users get similar access through visualization tools like Tableau.

This self-service model unchains data teams from the request queue. Trino lets engineers devote their time to supporting higher-value data projects. Standard SQL also helps engineers develop more efficient pipelines since a single Trino query can access data from multiple sources.

Cost and performance

Trino decouples compute from storage to optimize performance and cost. For example, using Trino to develop ingestion pipelines can eliminate the need to land data in temporary staging areas. In-memory processing, pushdown queries, and other features let queries run more efficiently at the source and reduce the amount of data in motion.

Trino pipelines with Airflow

Despite its advantages, Trino is not a perfect stand-alone solution for pipeline development. The larger and more complex workflows these batch processes require, force data teams to spend more effort and resources to address dependencies, sequencing, and the custom processes needed to monitor pipelines. These more complex requirements depend on having engineers with highly specialized skills.

Airflow’s strengths are a perfect complement to Trino. Airflow workflows handle scheduling, task execution, and monitoring. Trino’s query engine efficiently processes queries across a federated architecture. And all of this can be implemented programmatically thanks to Trino’s API and SQL support.

Apache airflow tutorial: DAGs with Python

Starburst offers enterprise-grade and fully-managed distributions of Trino that further streamline pipeline management with Airflow. Advanced optimizers further improve Trino’s performance and cost-effectiveness. Starburst’s universal discovery, governance, and sharing layer lets data teams manage all data assets through a single pane of glass.

With the introduction of fault-tolerant execution, Starburst’s Trino-based data pipelines no longer suffer from memory walls that could disrupt batch processes. The query engine can recover from individual task failures and resume without missing a beat. Fault tolerance reduces end-to-end latency and allows large queries to run with fewer compute resources.

Check out this tutorial to see how Starburst pairs with Airflow to create resilient data pipelines.