Share
More deployment options
Trino is more popular than ever, but what is it? Let’s start with a definition. Trino is a massively parallel processing, distributed SQL query engine. It helps users perform data engineering and data analytics tasks on very large data sets. There’s a lot to unpack in that definition. This blog will explore Trino, from definition to use case. You’ll see how this SQL engine is shaking up the big data industry, and disrupting other processing engines along the way.
What is a query engine?
Let’s start with the query engine itself. The premise of a query engine is relatively simple. First, you begin with a datasource. This could be either a Relational Database Management System (RDBMS) like PostgreSQL, or a noSQL database like MongoDB. It could even be another data warehouse or data lake using data federation. Next, in order to run analytics on the data, you need something to run and process those queries. This is true of both ad hoc queries or dashboards using real-time analytics. Using traditional relational databases, such as MySQL or PostgreSQL, the query engine is built into the database. This means that you can run SQL queries without needing any additional software.
Trino changes all of this. It opens up the idea of a query engine and makes it available for all kinds of different workloads. In this sense, it is part of the open data stack. In the case of data lakes and data lakehouses, whether they run Hive, Hadoop, Delta Lake, Hudi, or Iceberg, the data storage only stores your data; it doesn’t process it. For many people, running a data lakehouse based on cloud object storage is best. This is true whether using Amazon AWS S3, Azure Blob Storage, or Google Storage on GCP. After that, you need something separate to query it. That’s where Trino comes in, and it does its job quickly, efficiently, and with more support and integrations than any other pure query engine out there.
The history of Trino
Facebook originally created Trino in 2012 under the name of Presto. Facebook used it to query very large datasets, specifically legacy Hive data warehouses based on Hadoop HDFS. The goal was to cut data analytics query times down from days to hours, and from hours to minutes. Over time, the project gained adoption by many large tech companies. This includes Netflix, Uber, Airbnb, and LinkedIn.
In 2019, the co-founders of the project left Facebook and created a fork, later renaming it Trino. Since then, it has taken over as the de facto branch of the Presto/Trino project, with significantly more development, faster advancement of features, and more widespread adoption in the data community.
What makes a query engine a query engine
It’s worth diving deeper into exactly what makes a query engine a query engine. Without a doubt, one of the biggest misconceptions about a query engine is thinking of it as a database. Trino, like other query engines, does not store data. Instead, in order to use Trino, you need an underlying data source. Once you have that, Trino connects to the data source, and uses it to run queries. Importantly, it does this using a connector-based architecture. The architecture consists of a core query engine along with the ability to connect that engine to a wide variety of data sources.
Trino works best with data lakes and lakehouses based on Apache Iceberg, Delta Lake, Hudi, or Hive. Because it includes dozens of other connectors, Trino can also be used for query federation. Federated queries use data stored in multiple systems and databases. Trino can connect to and query all of them in unison. This approach uses joins to combine the disparate data with a single SQL query.
Understanding Trino architecture
In addition to having a connector-based architecture used to access different data sources, Trino also has a massively parallel processing (MPP), distributed architecture. This design allows it to scale up and down according to need, enabling it to handle large-scale datasets with petabyte or exabyte workloads. And because it can read from various data sources, Trino allows data engineers to create complex data pipelines that draw in everything that data analysts need to run complex data science projects using a distributed system and dashboards.
Using a single coordinator node and as many worker nodes as you need, a cluster can distribute a Trino query in the most efficient way possible. This might involve a handful of workers, or dozens, or even hundreds, each working in parallel. This approach ensures that no matter how large your dataset is, you can always use it for analytics. Trino also employs a number of optimizations including join reordering, predicate pushdown, and partial aggregations. Using these techniques, Trino intelligently avoids doing unnecessary work. It limits compute costs, and processes your query as fast as possible with very little latency.
The video below shows how the Trino architecture works in practice.
The benefits of using Trino for SQL queries
Trino almost exclusively uses ANSI SQL syntax. SQL is the main language used by data scientists, and most data engineers know it well too. This ensures that your queries are interoperable with other data analytics systems. It also makes it easier for clients, visualization tools, and other integrations to be compatible with Trino. The Trino ecosystem is vast.
This means that no matter what your data stack looks like now or in the future, it should be painless to integrate with Trino. This unparalleled combination of federation, integrations, and high performance is what makes Trino shine. Its use of basic SQL syntax, that most data scientists and analysts should already be well-versed in, ensures that you don’t need to learn specific tips or tricks regarding Trino usage. Once it has been configured and deployed, end-users should find it easy to begin working to deliver insights regarding your data.
Is Trino right for you?
Trino has two core use cases:
- Handling data lakes and lakehouses at scale
- Handling data federation for organizations with data in several different places.
If either of these scenarios apply to you, then you will gain the maximum value as a Trino user. Beyond this, Trino supports users who deploy it at small-scale. In these settings, even though performance isn’t a major concern and data federation isn’t typically necessary, Trino still provides a lot of value as an industry standard tool that’s easy to use and easy to integrate with other parts of the data ecosystem. As a general rule, any time that performance, scale, and cost are primary concerns, you’ll want to consider Trino.
Here’s how to make that decision properly
Of course, comparisons are important, and no less so with Trino. There are many benchmarks that you can find and opinions insisting that X tool is faster or Y engine is superior or Z database. Because Trino is a pure query engine, it is relatively easy to test, and this is the approach we recommend.
To do this, use the following approach. First, connect Trino and any other systems that you’re considering to your data stack. Second, run a typical analytics workload of what you might expect to run on a daily or hourly basis. Ideally, this will involve real queries that you’ve already run, and will use the hardware that you would actually use in real-world scenarios. Review the results, and compare these to your expectations. Repeat the process as many times as you need. As you conduct your experiment, remember that every system, workload, network connection, and data set is different. The only way to truly understand what will work best for you is to roll up your sleeves and try it out for yourself.
And of course, while high performance, scalability, and cost are very important, they aren’t everything. As you deliberate, make sure that you consider which features you need and which features you don’t need. Choosing a system that goes 10% faster doesn’t achieve much if, by doing so, you’re missing critical features that limit your ability to access or visualize your data in the way that you need or want.
Trino vs Presto
As discussed earlier, Trino was forked from Presto in 2019. Since then, both projects have remained under development and have diverged considerably over time. At the time of writing, Trino has seen more commits than Presto This means that Trino now includes more features than Presto. Today, Presto’s main selling point is vector acceleration for Hive and Presto on Spark, and it uses a different type of SQL known as prestoSQL. These two improvements primarily impact the Hive-Spark data stack that Facebook uses, and to this end Facebook itself has been a major contributor to the project.
In contrast, Trino has undergone more robust development on its core engine than Presto. This has allowed it to compete with Presto’s performance on Hive even without vector acceleration. Meanwhile, Trino includes additional features that set it apart from Presto. This includes features like SQL MERGE, local filesystem caching, fault-tolerant execution, polymorphic table functions, support for modern Java versions, and a number of new connectors. In light of this, the data community at large has largely shifted to supporting Trino instead of Presto, and the Trino community is a vibrant and dynamic open source community. This has caused Trino integrations to be better-maintained and more likely to remain that way into the future.
There’s no mincing words here: Trino is the better choice for virtually all scenarios compared to Presto.
Trino vs Spark
Spark and Trino are two different tools, with two different use cases. Because of this, they are not in direct competition with each other in the way that other query engines might be. Instead, comparisons between Trino and Spark are best assessed by reviewing your workload and choosing the best tool for the job. For example, Spark is best used for ETL/ELT and data transformation workloads. In this arena it still performs better than nearly any other tool available. Although Spark is not the fastest compute engine, it is reliable for ETL/ELT workloads. It also includes fault-tolerance. For this reason, Spark has extremely widespread adoption for handling ETL tasks.
On the other hand, Trino is primarily built for analytics. Unlike Spark, it is designed to access and understand your data as quickly as possible. It performs this type of workload much better and faster than Spark. For this reason,if you have serious analytics workloads, you should consider using Trino instead of Spark in this scenario.
There are also areas of convergence between Spark and Trino. Trino’s fault-tolerant execution mode is comparable to Spark, though its adoption in this area is less entrenched as the feature is newer. If you’re currently using Spark and are not currently using Trino, we wouldn’t recommend jumping from Spark to Trino with FTE enabled as a replacement for transformation workloads. However, if you are already using Trino, using a separate FTE cluster can simplify your stack by eliminating Spark from the picture. In this scenario, Trino’s fault tolerance would allow you to use Trino for your entire workload, simplifying your data stack considerably.
Trino vs Starburst
Starburst is the open-core company behind most of Trino’s ongoing development. We offer both on-prem and cloud versions of Trino, called Enterprise and Galaxy, respectively. Why use Starburst if Trino is so powerful? Although Trino is open source, and you can deploy and manage it yourself, it is also highly manual and complex. Because of this, some organizations lack the internal resources to properly adopt Trino in its open source form, despite benefitting from the architecture itself. Starburst is designed to solve this problem, making Trino easy and accessible to everyone. Perhaps you’re unsure how to best go about using Trino. Maybe you don’t want to deal with the headache of provisioning and managing your own servers and clusters. In these cases Starburst can simplify the use of Trino for you.
Starburst has also made a number of proprietary improvements to Trino, both for Starburst Enterprise and Starburst Galaxy. These include Warp Speed, a feature that allows you to index your dataset, achieving query speed improvements of up to 700% and reducing compute costs by up to 40%. Starburst’s version of Trino also includes several additional connectors. In addition, Starburst also offers enhance data telemetry and data governance, and provide more flexibility and versatility with access control and data security.
Why Starburst is the best way to use Trino
Overall, why should you choose Starburst if you’re considering Trino? There are a number of reasons. First, we are the Trino experts, home to the Trino/Presto co-founders and engineers who have been working on the project for over a decade. Second, we make the job of using Trino easy. With Starburst Galaxy, Trino is fully managed in the cloud for you. This allows you to worry less about configuration and tuning, and think more about what you want to do with your data.
Finally, Starburst offers unique features that augment and extend Trino. With autoscaling and auto-shutdown, you don’t need to worry about capacity management, and because Starburst is maintaining it, there’s no need to worry about updates to Trino, we do all the hard work for you without any downtime.
Ready to explore Starburst Galaxy and Trino together? Sign up for a free trial.