What is meant by “query data where it lives” with Trino?

bitsondatadev · October 12, 2021, 1:47am

TL;DR: Query the data where it lives is a quick way of saying you don’t need to move the data from other databases into one database in order to run queries across all of your data. In other words, Presto can act as a hub to query multiple databases, and perform further processing on the data using standard ANSI SQL.

One use case that I ran into in my last company was we needed to have a standard way to access data from our Elasticsearch cluster and our data lake (Hive/HDFS) and combine those two data sources. The only difference is we used Trino rather than Presto, since Trino is the fork that the creators of Presto now maintain. The examples still applies to both.

Elasticsearch stores data in an Apache Lucene index and is really only accessible through Elasticsearch clients which derive from the Elasticsearch query DSL.

Hive’s data is generally stored in an open file format (ORC, JSON, AVRO, or Parquet), and resides in a distributed filesystem like HDFS or S3 cloud storage solutions. You query it via HiveQL which is kind of like SQL but a special dialect.

We had to write and maintain a lot of code to interface with both of these systems, especially to maintain the models that queried each of these. There were countless issues and bugs that came from maintaining this code and keeping both systems aligned with correctly querying the data from each of these systems. For example, take a look at this Elasticsearch query verses the HiveQL equivalent.

curl -X POST "localhost:9200/my_table/_search?pretty" -H 'Content-Type:application/json' -d'
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user_id" : "kimchy" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      }
    }
  }
}
'

HiveQL equivalent of the Elasticsearch query above.

SELECT * 
FROM my_table 
WHERE user_id = "kimchy" AND NOT (age >= 10 AND age <= 20);

What is truly powerful about Trino is that if I want to get all of this data in each system. Not only can I access both systems using just SQL, but I don’t have to move this data between systems to get the answer to this systems. I could run the following query.

SELECT * 
FROM hive.default.my_table 
WHERE user_id = "kimchy" AND NOT (age >= 10 AND age <= 20)
UNION ALL
SELECT * 
FROM elasticsearch.default.my_table 
WHERE user_id = "kimchy" AND NOT (age >= 10 AND age <= 20);

Now, you can generate the same query across both systems and union the output together and know that Trino is constantly validating correctness from all of the different connectors on every release. This is just one of the use cases Trino solves for along with being able to replace a lot of your ETL processes, etc…