Using Trino for Your Data Transformations

StrategyAugust 7, 2024

Monica Miller
Developer Advocate
Starburst
Jacqueline Vail
Brand Marketing Manager
Starburst

Monica Miller
Developer Advocate
Starburst
Jacqueline Vail
Brand Marketing Manager
Starburst

More deployment options

Request Enterprise trial license key →

While you may be used to performing data transformations via Python or Scala, the truth is that you can actually achieve the same results with SQL using Trino as your engine.

Many still think of Trino strictly as an ad-hoc analytics engine, but companies large and small utilize Trino as their ETL engine of choice. This blog will walk you through the history of Trino and how you can use SQL and Trino for your data transformations.

History of Trino

In the early 2000s, the availability of large quantities of data meant that companies had access to insights previously unimaginable if they could harness the power within the data. But, this also meant they had to figure out a method to query unprecedented volumes of data. To solve this problem, Hadoop was created to meet the needs of big data which meant that data teams could run queries faster and more effectively.

SQL is widely accepted as one of the most familiar languages in the data community, so in 2008, Facebook created Hive to query petabytes of data in Hadoop using a SQL-like interface. Data consumers had the new ability to run queries and gain insights quickly. However, Hive was notably slow, and users were limited by the number of queries they could run each day.

Data teams needed a query engine that was distributed, fast, and performant. In 2012, Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang created Trino (formerly Presto) at Facebook, as an alternative ANSI SQL-based query engine to Hive. Trino was originally thought of as an engine with the main purpose of running interactive queries, but over time, this use has expanded through the service provider interface (SPI).

By implementing the SPI, Trino can use standard operations internally to connect to any data source and perform operations on any data source. Now, an interactive engine originally built for a petabyte-scale data lake can also perform interactive analytics on multiple data sources at the same time. Trino can join data from MySQL, Hive, Cassandra, etc. using an ANSI SQL-based standard.

While Trino may have gotten its start as an interactive query engine, Trino was actually quickly utilized for multiple different types of workloads, including data transformations or ETL workloads. In 2014, users started scheduling batch/ETL queries with Trino instead of Hive and only four years later, 50% of ETL workloads and 85% of new workloads were on Trino. Trino rose in popularity among users as the single query engine serving both interactive and batch workloads, returning query times that were much faster than previously seen on Hive, while still capitalizing on a SQL-based interface.

Not only were the teams at Facebook utilizing Trino, but other companies started to catch on to the hype, including Salesforce and Lyft.

Trino’s evolving architecture

Trino’s massively parallel processing style architecture was built with the goal of being fast. This meant that tasks were interconnected resulting in an all-or-nothing query result. If any part of your query failed, then you were forced to restart the query from the beginning, taking up valuable time and resources.

There were two main issues that plagued long-running queries:

Running into the memory wall: if your dataset grew too large, you had capacity constraints and the query might fail.
MPP architecture: Trino’s original architecture split queries into multiple stages, with multiple tasks. But, because tasks are all interconnected, they need to be running at all times and all the data needs to be held in memory. This makes the operation fragile, the memory load significant, and if any task fails, the whole query needs to be restarted.

To solve this problem, Trino introduced fault-tolerant execution mode (FTE). Contributors re-architected Trino to introduce a buffer service that stores intermediate data from tasks in case of failure. This meant the query could retry a single task in each stage, instead of being forced back to restarting the query from the beginning. Fault-tolerant execution also results in a reduction of end-to-end latency for scheduled jobs since you don’t need to start from scratch each time, and you can run larger queries with fewer resources.

History of SQL transformations

At the beginning of our big data struggles, Hive was created because data users wanted a SQL-based interface to access data. It reduced the barrier of entry and allowed as many people to use the data as possible, which is also why many data transformation tools find their language of choice to be SQL as well, including the popular modern data stack transformation tool, dbt. As evident by the over 65,000 Slack members in the dbt community, there is still a great desire to base data transformation tools around SQL. Like the Trino community, dbt users unite on SQL, so much so that this is one of the company’s main pillars. Trino and dbt are utilized in many cases together to apply a modular approach to building data pipelines in SQL.

By using Trino and dbt together for your data transformations, all SQL-based, you are able to combine the best of both worlds. For example, Yello re-architected its entire data platform in order to use Trino, dbt, S3, and Apache Iceberg to do all their data transformations on its data lake.

Using Trino for your data transformation recap blog

Spark versus Trino

A big debate among data engineers is whether to use Spark or Trino for designated workloads, specifically data transformations or long running queries. With the implementation of fault-tolerant execution in the Trino engine, the previous rhetoric of Trino strictly serving as a fast, interactive query engine was removed. The question now becomes around which language do you want to base your data transformations in, and how complicated those data compilations are expected to be. With SQL functionality being added to Trino regularly, you are likely to be able to find the transformation operation you’re looking for.

Still, how do you decide? Shift the question from which engine to which language do you want to use for the job at hand and consider the following constraints:

What is the use case?
Who is accessing the data? How are they doing so?
Which language do I want to use?

If you are interested in using SQL, then Trino would be a great engine!

Summary

SQL-based technology is extremely important in today’s data landscape because it is easily accessible for most data engineers. Trino, a SQL-based query engine, was created by Facebook engineers in 2012 to overcome the limitations of Hive in handling large data volumes. Despite early architectural challenges, Trino’s introduction of fault-tolerant execution mode allowed for more efficient handling of query failures, significantly improving job completion rates and resource utilization. With the addition of dbt, a SQL transformation tool, Trino, and dbt can work together to access different sources from a single SQL query or when you need to run performant SQL queries. By utilizing Trino and dbt, you can build efficient and scalable data transformation architectures. In this lab, Monica Miller shows you how to build out a SQL-based data transformation pipeline using Trino on data living in your data lake.

Using Trino for your data transformation recap blog