Fault Tolerance
Fault-tolerant systems are more resilient in the face of errors, outages, or system failures. Hardware fault tolerance uses backup systems, redundant components, and other techniques to prevent downtime in the event of power outages and ensure business continuity.
Fault-tolerant software is particularly important in data management. When batch processing petabyte-scale data sets through extract, transform, and load (ETL) pipelines one small error can break the processing run and force engineering teams to re-run the pipeline at great expense.
This article will discuss how the massively-parallel processing query engine Trino evolved to support ETL pipeline development with fault-tolerant execution.
Fault tolerance vs High availability
Fault tolerance contributes to the performance of high availability systems (HAS) such as transactional databases, operating systems, and network connectivity. Fault tolerance techniques like fault detection and mitigation prevent unscheduled downtime and ensure the high availability of these critical systems.
Is fault tolerance good or bad?
Within the proper context, fault-tolerant designs are better. They prevent loss of service — or at least deliver a graceful degradation that minimizes business impacts. However, these techniques may impose system performance penalties that aren’t worth the reduced failure risks.
Consider a business intelligence analyst writing queries for an ad hoc analysis. A failed query is inconsequential. The analyst simply corrects the error and re-runs the query. Even with the enormous datasets used to set up a machine learning project, re-running a failed query, while inconvenient, may only delay the project by a day.
However, enterprises build data-driven business cultures based on modern data lakehouse architectures. These systems depend on data pipelines moving vast amounts of data to support business decision-making as well as advanced analytics like the development of machine learning algorithms. Query engine failures take on much greater significance.
Challenges of Trino without fault tolerance
Trino’s open-source, massively parallel query engine allows organizations to federate their data sources and run fast, efficient queries at petabyte scales. At the same time, Trino’s foundation of ANSI-standard SQL made data more accessible to a broader range of users.
Data engineers quickly learned that using Trino to build their ETL data pipelines greatly simplified their work. For instance, using standard SQL rather than proprietary interpretations expands the available talent pool, can replace complex HiveQL implementations, and reduces development lifecycles.
But those benefits came at a price. Trino was designed for high-performance analytics, not the slow but fault-tolerant computing requirements of ETL batch processing. This mismatch created three challenges:
Memory wall
Trino’s massively parallel processing architecture assigns query tasks to worker nodes where the tasks get executed in parallel. Since the tasks are interconnected, they must be in memory while the query runs. This approach produces fast, low-latency queries.
However, Trino’s memory dependence becomes an issue when datasets outgrow a compute cluster’s resources or when the compute cluster experiences capacity constraints. Network errors can also become problematic if disconnected worker nodes miss the cluster’s scheduled resources.
Any of these issues could arise when running extremely large ETL batch processes. Without fault tolerance, a single failure in one worker node would stop the batch process in its tracks.
Wasted resources upon query failure
When Trino-based batch processes failed, there was no way to use the completed work. The batch had to be re-run from the beginning, wasting compute resources that could go to other tasks. The wasted time could also have downstream impacts on software systems and decision-makers who depend on the pipeline’s critical data output.
Engineers had two equally unappealing ways to avoid this scenario. They could throw expensive compute resources at the problem to ensure clusters always had sufficient memory. Or they could design the pipeline as a series of smaller workloads, significantly increasing complexity, development time, and risk.
Resource management and concurrency
Resource management and the risk of query failure are even more complicated within a distributed data architecture that democratizes data access. Information resources are no longer under the exclusive control of a central data team, which means queries from multiple systems and users will be running concurrently.
Some of these queries may have to pause or stop should demand spike beyond a cluster’s planned capacity, increasing the risk of disruption to batch processes.
Why Fault Tolerance Is Good
In 2022, the Trino team introduced Project Tardigrade, its solution to the query engine’s sensitivity to resource-related faults. The project combines granular retries with resource-aware scheduling. By making Trino more fault-tolerant, Project Tardigrade improves batch processing workloads.
Reliability and availability: Trino-based batch processes are less susceptible to resource constraints for on-time data delivery to mission-critical applications.
Enhance user experience: Data consumers and the decision-makers they support no longer experience service interruptions due to pipeline failures.
Improved data integrity: Consistent pipeline execution minimizes the data loss or delays that undermine data integrity.
How a fault-tolerant system improves query performance
Trino’s fault tolerance functionality is based on two elements: a buffer for the query’s work in progress and a resource-aware coordinator. When setting up fault-tolerant execution, engineers create an external bucket in their object storage system, whether AWS S3, Azure Blob Storage, or Google Cloud Storage. Worker nodes pass data to this external buffer as they complete each task. A coordinator node monitors the cluster’s workers. Should something go wrong, a graceful failover process kicks in. The coordinator kills the old worker and points a new worker to the buffered data.
Reliability: A worker node is no longer a single point of failure for the entire batch. The engine keeps processing from the point of failure, so batches are completed on time.
Scalability: Rather than staging queries to fit within available memory allocations, load balancing process scale queries up or down as needed.
Flexibility: Concurrent queries no longer compete; coordinators can prioritize queries without disrupting other workflows.
Optimization: Trino adapts to changing conditions by dynamically adjusting query execution.
Fault tolerance techniques and high availability with Starburst
Starburst creates modern open data lakehouse architectures and federates enterprise storage within a single point of access for big data analytics. Founded by Trino’s original developers, Starburst combines Trino’s high-performance SQL query engine with governance, optimization, and accessible enhancements into a scalable enterprise-grade solution.
Enterprise connectors: An ecosystem of over fifty connectors links your storage infrastructure into a unified access layer.
Granular access control: Role-based and attribute-based access controls let you limit access at the schema, table, column, and row levels.
Universal search and discovery: Find data objects, explore catalogs, and automatically integrate new tables.
Workload optimization in Starburst
Trino’s fault tolerance upgrades, as well as Starburst Warp Speed and Query Result Caching, comprise Starburst’s suite of workload optimization features.
Starburst Warp Speed uses smart indexing and caching to accelerate queries. Warp Speed evaluates new queries and automatically creates optimized indexes and caches and attaches them to each worker node. Since workers don’t have to look outside the cluster, they can run much faster.
Query result caching further boosts query performance by temporarily storing results for reference by subsequent queries. Retrieving rather than re-calculating these results saves considerable time and resources.
Implementing Starburst fault-tolerant execution
Fault-tolerant execution is not enabled by default, as most analytics queries do not need it. You can quickly implement the feature in Starburst Galaxy and the Starburst Enterprise Platform.
Starburst Galaxy is our fully-managed, cloud-based analytics solution that provides a unified view of your storage infrastructure through an intuitive interface. Creating a fault-tolerant cluster in Starburst Galaxy is a simple four-step process:
- Click the Create cluster button.
- Add catalogs to the new cluster.
- Configure the cluster for your cloud storage provider.
- Select “Fault Tolerant” from the Execution Mode drop-down menu.
The Starburst Enterprise Platform is a self-managed solution for implementing a data lakehouse analytics platform in private clouds or data centers. Starburst’s taxonomy uses “exchange storage” to reference the external buffer that enables fault tolerance. An exchange manager service in the coordinator and worker nodes transfers data between those nodes and the exchange storage.
Implementing fault-tolerant execution in the Starburst Enterprise Platform does not share Galaxy’s point-and-click user experience but is nonetheless a straightforward process:
- Configure the exchange storage (for example, an AWS S3 bucket).
- Configure exchange management for the coordinator and worker nodes.
- Configure fault-tolerant execution for the coordinator and worker nodes.
- If using Kubernetes, redeploy your Starburst instance. Otherwise, restart the cluster.