When Martin Traverso, Dain Sundstrom, and David Philips set out to create Trino more than a decade ago, their goal was to replace Hive with a faster, more powerful interactive query engine. Anyone reading this blog knows they succeeded, and that Trino has become the query engine of choice at industry leaders like Netflix, Comcast, Lyft, and many others. However, Trino was not without its limitations, especially for batch and ETL workloads. Two years ago, frustrated by these limitations, Andrii Rosa, Lukasz Osipiuk, and Zebing Lin set out to re-architect the query engine and, in Martin’s words, give it superpowers.
At Datanova 2023, Martin recounted this effort and the thinking that went into the latest and greatest edition of the world’s fastest distributed query engine. Dain and David joined him and led the Q&A at the end of the session, so for those of you who missed their presentation, we thought we’d recap the highlights quickly here.
Running into the memory wall
Trino was designed for fast, interactive queries, but as more engineers began using it for batch and ETL workloads, the tool revealed a few limitations. One of these was the memory wall. If your dataset grew past a certain size, or you had capacity constraints in your compute cluster, the query might fail. So to enable the query to run reliably, you had to either scale up your cluster, which was a cost prohibitive for many customers, or you’d have to break up the query.
The second issue was that queries had to run successfully all the way to completion or else you’d have to re-start the operation – and spend more on compute – to run them again. If a query failed two-thirds of the way through, there was no way to utilize the computation that had been performed up to the failure point. This wasn’t a huge problem with smaller queries, but as you scaled up, you’d risk wasting both resources and time on unexpected failures.
These challenges, along with a few other limitations, prompted Martin, Dain, and David to explore if there was a better way to architect the query engine.
MPP to FTE
Martin explains how Trino was originally based on a classic massively parallel processing (MPP) architecture in which a planner splits a query into multiple stages, each of which includes multiple tasks. These tasks are all interconnected and need to be running at all times while the query is in progress. This translates into fast, low-latency performance. It’s part of what made Trino so successful in the early years.
The problem? Since all those tasks are interconnected, and they all need to be running at all times, and all the data needs to be held in memory, the operation becomes fragile at scale. The memory load is significant, and if any part fails, the whole query needs to be restarted because the tasks are intricately linked.
So as Martin explains it, the team developed a basic wishlist for what they wanted out of their revamped architecture:
- Tolerate individual task failures: Instead of starting over when a node failed, they wanted the engine to continue processing, recover, and re-use computation that had already been performed.
- Reliably run queries of any size: The working dataset had to fit in memory in the initial Trino engine, but they wanted to be able to stage a query and run it in chunks so it could be scaled up or down according to the user’s needs.
- More flexible resource management: Ideally, they wanted the newly architected Trino engine to be flexible enough to allow you to bump high-priority queries to the front of the queue without canceling others.
- Adapt query execution dynamically midstream: Another goal was to equip Trino with the ability to adjust according to different running conditions and optimize its execution strategy midstream to get the best possible result.
What they developed in response to the different challenges and technical goals – not all of which are detailed here – was a novel architecture based on fault-tolerant execution.
The basic idea is built right into the name. You don’t need to restart the query if a single task fails. Any task can stop and restart without needing to reproduce any previous results, so you reduce wasted computation. This cuts end-to-end latency, too, since you don’t need to restart from scratch each time. And since you don’t need to hold all the data in memory, you can run large queries with fewer resources.
For a more in-depth explanation and Q&A, we recommend checking out the Datanova session.
Get started with Fault Tolerant execution mode
The easiest way to try out Trino’s new fault-tolerant architecture is via Starburst Galaxy. Simply create a free cluster, select “Fault Tolerant” execution mode, and start querying.
Try out Trino’s new fault-tolerant architecture via Starburst Galaxy
Create a free cluster, select “Fault Tolerant” execution mode, and start querying.