

The journey from Presto to Trino and Starburst
Today, data solutions require raw data to achieve actionable insights. And whether that involves feeding data analytics, machine learning (ML), or AI models, the need for data is both massive and growing. For context, we’re not just talking terabyte-scale data needs. Many companies are grappling with how to efficiently store, query, and analyze petabytes of data at a time.
A key part of solving this thorny problem is the SQL query engine, which uses Massively Parallel Processing (MPP) to query petabytes of data effectively. Developed originally at Facebook in 2012, Presto was one of the first highly successful SQL query engines. The project has evolved drastically over time, paving the way for the engine we now know as Trino.
In this article, we’ll look at how Presto directly led to Trino and how companies like Starburst leverage Trino today to make its vision of ridiculously scalable data a reality.
What is a SQL query engine?
How did we get here? First, let’s take a step back before we dive into the details and look at how the need for a SQL query engine arose in the first place.
Hive: The first big data SQL query engine
As data grew rapidly in the early 2000’s, companies like Facebook moved from crunching their data in databases such as Oracle to processing them in Hadoop clusters. As their volumes grew from GB/day to TB/day, companies looked to Apache Hive to create data warehouses. This gave them a SQL-like interface able to access data stored in the Hadoop Distributed File System (HDFS) and cloud object storage storage providers.
As Facebook grew, so did its data. The data volume grew so large that it required multiple Hive clusters, and queries became slower, less predictable, and less reliable. Facebook needed an SQL query engine that could provide fast results over a massive scale.
Presto: V1 of the ridiculously scalable SQL query engine
This is where Presto comes in. The Presto project premiered in the early 2010s as Presto DB, an all-purpose SQL query engine for the enterprise. Presto began at Facebook in 2012 to speed up queries over approximately 300 petabytes of data that the company had been querying using Apache Hive after engineers started hitting Hive’s performance limits. By the fall of 2013, Presto was open sourced to the community, gaining popularity among top companies like LinkedIn, Netflix, and Teradata.
After 6 years, the creators of Presto left Facebook as the organization tightened control over the open-source project in 2018. The original project continues today under the name Presto, though without any of the original creators attached.
What is Trino?
So, what did Presto’s creators do? In 2019, they forked the project, creating what is now known as Trino. This began an exciting new chapter for parallel processing and the opportunity to continue innovating on “a SQL query engine that runs at ludicrous speed.”
At its base, Trino aimed to provide a faster and more interactive alternative for accessing data interactively in HDFS via Hive. It also splits metadata and storage access, enabling it to expand to querying an entire data ecosystem via ANSI SQL. From there, it continued to build out a host of other features and architectural improvements, such as support for Docker deployments, security improvements, and a wide variety of client tools and integrations.
Today, it’s managed by the Trino software foundation, a vibrant open source community. Since this transition, development on the project has increased exponentially as compared to its predecessor. Many top companies, such as Netflix, Lyft, Stripe, Salesforce, and LinkedIn, made the leap to Trino. It’s also integrated into a wide variety of data platforms, including Amazon Elastic Map Reduce (EMR), Amazon Athena, Starburst, and more.
How Presto and Trino both solve the Hive problem
Presto and Trino are both SQL query engines – an interface layer that provides a unified approach to interactively querying data in an enterprise. Through Massively Parallel Processing (MPP), they provide a single point of access for querying a large number of data sources using a well-known syntax (SQL), decomposing the query into parts, and running it in parallel across multiple nodes in the most efficient way possible.
To execute a query, users send SQL commands to a coordinator, which creates a query plan. The coordinator splits the query across worker nodes, which run it against data stored in memory.
A SQL query engine is not itself a data store. Rather, it connects to numerous underlying data sources —particularly the data lakes and data lakehouses built on Apache Iceberg, Delta Lake, Apache Hudi, or Apache Hive. It prevents data from becoming siloed, offers faster time to insight for making business decisions, and provides fast query performance over petabytes of data.
Trino: Presto perfected
Trino literally wouldn’t exist without Presto. So it’s no surprise that, given their shared history, there are a number of similarities between the two systems:
- Both are and continue to remain open-source projects
- Both aim to solve the problem of querying large, distributed data sets at scale with blazing-fast performance
- Both use SQL – the lingua franca of data – for ease of use and compatibility with other tools in your data stack
However, since their divergence, Trino has transcended Presto in several key areas, including:
- Pace of development
- Features
- Easy migration from Hive
Pace of development
Currently, Trino is a more dynamic open-source project, with development occurring at three times the pace of modern Presto. The pace isn’t slowing down, either. 2024 was the project’s most active year ever, with 30 new releases and 5,000+ additional commits.
Features
This active pace of development means Trino has eclipsed Presto in many key feature areas, including:
- Docker-based deployments to support deployment on container orchestration systems such as Kubernetes
- Fault-tolerant execution mode for handling batch and ETL/ELT jobs with high reliability. This eliminates the need for the Spark/Trino combination, as Trino supports Spark’s reliability features natively
- Table functions, which make it easier to write powerful queries or run syntax native to connectors. For example, you can add an anonymous or ephemeral table capability by defining a function that returns a table. Functions can even be polymorphic, returning different row types
- Dynamic filtering to speed up queries that contain JOINs
- Expanded SQL support for keywords like MERGE, as well as dozens of extra functions
- Variable-precision temporal types, with precision down to picoseconds. This is important for time-critical systems such as financial transaction processing
- Support for multiple open-table formats such as Apache Iceberg
- Load balancing via Trino Gateway
There’s also now a rich ecosystem built on top of Trino, as well as various add-ons, such as support for OpenTelemetry and Open Policy Agent.
Easy migration from Hive
Trino supports a built-in procedure for migrating tables from the Hive format to the Iceberg format. This simplifies moving key workloads onto a more modern and highly performant architecture. It can also boost query performance by up to 95% compared to Hive.
Starburst: Easy Trino
You can deploy Trino on top of a number of data architectures. However, deploying Trino and managing it as a highly available service can be a manual and complex endeavor.
Starburst makes Trino easy. It does this by making Trino:
- Easy to deploy and manage. Starburst provides enterprise-grade hosted Trino for turning a wide number of use cases into reality, including BI, data transformations, data-driven apps, and Generative AI (GenAI).
- Easy to run with high performance. Starburst combines the speed and power of Trino with the Apache Iceberg open table format. This Icehouse architecture provides numerous performance benefits over the Hive format while also providing out-of-the-box support for ACID properties.
- Additionally, Starburst adds improvements to Trino such as Warp Speed, our patented technology that can accelerate query performance up to 7x and reduce compute costs by as much as 40%.
- Easy to use. With the Icehouse, you don’t have to know how your data is stored to access it.
- Easy to integrate. The Icehouse is an open data lakehouse that is multi-vendor, cross-platform and supports centralized and distributed workloads. If you have an existing investment in Hive, for example, you can move as much or as little of your existing workload into Apache Iceberg as you see fit.
Trino + Starburst
Trino continues to lead the SQL query engine market, providing unparalleled speed for workloads of any size. Combined with Starburst, it provides an enterprise-grade solution for any company that needs fast, flexible, and easy access to its data – no matter where it lives.