Trino for Large-Scale ETL @ Lyft

  • Charles Song

    Charles Song

    Senior Software Engineer

    Lyft

  • Ritesh Varyani

    Ritesh Varyani

    Senior Software Engineer

    Lyft

Share

Lyft operates one of the largest transportation networks in the world. A business like ours depends on data on so many levels. Data relating to rides and routes, bike and scooter rentals, the 18M+ annual users of our ridesharing service, and much more are stored in several different systems, including AWS S3, MySQL, PostgresDB, and the surprisingly popular Google Sheets. In fact, our analysts love spreadsheets. We are part of the Data Platform team, and in that role we support all data inquiries for internal teams.

At Lyft, Trino has proven to be a fantastic tool for smaller, ad hoc queries. But in this post, we’d like to explain how and why we have begun using Trino for large-scale ETL.

Trino @ Lyft Environment

Trino supports all of our data analytics user interfaces at Lyft, including Tableau, Airflow, Mode and others. We’re running a Mozart Gateway to operate as a master gateway for every query, and queries then move into a proxy tool that we developed internally, but is now available open source. The way we’re running at scale is very complex — with varied use cases, users, teams, etc. — and we needed to have one intelligent proxy sitting in front and orchestrating where a query is routed within our Trino environment.

On average, we see:

  • 250,000 queries per day
  • 10PB of daily read data
  • 100TB of daily write data
  • 750 total AWS EC2 instances

In terms of our Trino deployment, we have experienced some difficulty with the noisy neighbor problem, bottlenecks, the sheer number of new releases per year (39 in the 12 months prior to November ’22!), and a few other challenges. Ultimately, though, Trino works really well for Lyft, which is why we keep investing in it and why we started looking to expand its role.

ETL & Trino

Trino was built for interactive queries, but we’d been wanting to see some improvements with our ETL jobs, and thought Trino might be a good fit. Our job on the Data Platform team is to make sure everything runs reliably and efficiently at scale, and we felt that with some effort on our part we could improve ETL performance, reliability, and cost efficiency by switching from Spark to Trino.

As we started advancing our Trino ETL initiative at Lyft we outlined a few priorities:

Cost

We wanted to make sure that as we used the larger clusters needed for ETL, we were doing everything possible to bring costs down. One of the steps we took was to move Trino to an AWS Graviton Instances, which is designed to optimize price and performance, and we yielded roughly 10% in hosting savings with the switch.

Performance

Our data engineers wanted better performance, faster development and turnaround times; there was an internal push to deprecate Hive and move to Trino to make our workloads run faster.

Resiliency

This is very important for us on the Data Platform team, and we built resiliency into the orchestration layer for ETL.

For the Trino section of our analytics architecture, we separated jobs into four basic use cases – production ETL pipelines; high-priority work; ETL backfill; ETL testing and DAG development – and set up the system to prevent them from interfering with one another. Every DAG gets its own resource group, we use weighted tiering to uplift high-priority pipelines, and we follow best practices relating to broadcast joins, query sharding and scaling writers for ETL.

So, is this new approach working?

Trino for Large-Scale ETL Results

After making it generally available in July of 2022, we’re seeing 2.5PB of daily read data, 60TB of daily write data, 480 unique DAGs, and 60,000 queries daily across different ETL clusters. We successfully migrated off of Hive to Trino for ETL – although some of this work had to be done manually – and we’ve seen a huge reduction in ETL DAG runtimes from Hive to Trino, ranging between 30-90%. Although reduction in query times is a factor of cluster size, query shaping and other factors our best case production ETL job runtimes dropped from 5 hours to 20 minutes.

That is a huge win for our team.

Autoscaling has helped us, especially with ETL use cases, as it allows us to increase cluster capacity anytime we see utilization start to spike, then scale back down when appropriate. We had a few challenges with the 365 upgrade, but the community was very helpful, and those have since been contributed back to the community. Looking ahead, we’d like to further improve reliability, explore the use of sharding at the orchestration layer to break down massive queries, enable fault-tolerant execution, and definitely move to Tardigrade.

There is some work ahead of us, but we are very pleased with our decision to start using Trino for ETL jobs at Lyft.