Fueling Trino large-scale geospatial analysis with Starburst Warp Speed

  • Guy Mast

    Guy Mast

    Product Manager

    Starburst

Share

In our last post, we discussed two methods for running geospatial analysis with Trino and the Hive connector and explored a few optimization techniques to improve performance/cost. If you recall, we were able to improve performance by a factor of 1.2-6.8x for our queries but still ended up with ~30s response time when querying a 4.4B rows dataset on a 4 nodes cluster.

In this post, we’ll explore how the Starburst Warp Speed can significantly improve the cost/performance ratio, instantly accelerate and improve the interactivity for Trino geospatial queries, enabling large-scale analysis.

What’s missing for enabling a geospatial analysis at scale

Geospatial analytics on big data is particularly challenging because geospatial operations such as spatial joins, nearest neighbor, or point-in-polygon are inherently highly selective and computationally expensive without using an auxiliary data structure to map the geo-locations to the underlying rows. Many Trino users want to leverage their existing investment in Trino to run spatial analysis at scale — but face challenges in balancing cost/performance with business needs.

A common question on the Trino community forum and Slack channel is whether partitioning or indexing could enable faster performance of geospatial data analysis. Here are a few examples:

Question #1

“I’m looking for a solution to query massive point cloud stored in iceberg with Trino. While doing some tests, I was wondering if there is a way to improve geospatial queries in Trino? something like, building and storing a KDTree, pruning unwanted files by using a geohash or something ?”

Question #2

“Hello, I’m evaluating Trino (very green just reading docs to see if it fits our use case) and have some questions about its spatial capabilities. Does it use spatial indexes for joins between different systems? E.g. When joining points in Hive to polygons elsewhere, will it create spatial indexes in presto itself for the join?”

Question #3

Hi all, I am curious to hear about the community’s experiences with large spatial joins and their performance. We’re evaluating Trino for ad-hoc analytics on our data in AWS (in parquet).

Answer:

In the event that you are dealing with a very large geo data set, partitioning your events into smaller time intervals will allow the event data to be broadcast. You can also partition the geo data into regions, and run a larger number of smaller queries against individual or subsets of regions.

Challenges with partitioning data, by region

Solutions that often come up deal with partitioning data by region, or with suggestions to have any kind of index within Trino.

It’s worth mentioning that partitioning by high cardinality columns such as region might help a bit, but will also surface new challenges such as long-tailed data distribution across partitions and extremely small files per partition, which can significantly degrade query performance.

Dynamic geospatial indexing with Starburst Warp Speed

Starburst Warp Speed automatically accelerates queries with smart indexing and caching. It’s completely plug-and-play, just start running your queries and Warp Speed will create indexes and help you deliver the best possible performance and concurrency for geospatial queries.

Running the same experiment as before (detailed setup, queries, and results at the end of the article), now using the Starburst Warp Speed, we were able to improve performance by a factor of 5X on average!

With Warp Speed, you can run geospatial analysis at scale, while boosting cost/performance, future-proofing SLA, and growing the volume and dimensionality of analyzed data.

Mini benchmark data

Setup

4Xr6gd.4xlarge machines on AWS were used for executing queries on both Starburst’s Hive connector and Starburst Hive connector which is accelerated with the Warp Speed technology.

Results

Use Case Query Hive (s) Warp Speed(s) Warp speed improvement vs. Hive
Use case #1
Distance from points of interest 100m Query #3.1 29 20 1.45
Use case #2
Distance from points of interest 500m Query #3 27.5 4 6.875
Use case #3
lookups in a geofence Query #4 33 5 6.6

Queries

Query #1 (500M)

SELECT 
    count(distinct rider_id)
   
FROM 
    hive.ride_sharing_dataset.trips_data_bing as t,
    hive.ride_sharing_dataset.places as p
WHERE
    great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;

Query #1.1 (100M)

SELECT 
    count(distinct rider_id)
   
FROM 
    hive.ride_sharing_dataset.trips_data_bing as t,
    hive.ride_sharing_dataset.places as p
WHERE
    great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.1;

Query #2

SELECT 
    count(distinct driver_id)
   
FROM 
    ride_sharing_dataset.trips_data_bing t
WHERE
    ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752, -122.45635986328125 37.072130604487681))’),

    st_point(lon,lat));

Query #2.1 (adding bounding box with lat/lon)

SELECT 
    count(distinct driver_id)
   
FROM 
    ride_sharing_dataset.trips_data_bing t
WHERE
    lon between -122.45635986328125 and -122.45567321777344

    and lat between 37.72130604487681 and 37.72184917678752

    and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),

    st_point(lon,lat));

Query #3

with p as (

SELECT 
    lat,lon,qkey
   
FROM 
    ride_sharing_dataset.places 
    
    cross join unnest (transform(bing_tiles_around(lat,lon,19,0.5),t -> bing_tile_quadkey(t))) as t(qkey))
 
   — use that list to filter matching tiles

Query #3.1

SELECT 
    count(distinct rider_id)
   
FROM 
    ride_sharing_dataset.trips_data_bing as t,
WHERE
    v.qkey = r.qkey

    — match the bounding tile

    and great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;

Query #4

with q as (

SELECT 
    qkey
   
FROM 
    (

    values (‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752)) ‘) ) as a(p)

    cross join unnest (transform(geometry_to_bing_tiles(ST_Polygon(a.p),19),

    t -> bing_tile_quadkey(t))) as t(qkey) )

SELECT
    count(distinct driver_id)

FROM
    ride_sharing_dataset.trips_data_bing t,
    q

WHERE
    q.qkey = t.qkey

    and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),

    st_point(lon,lat));

Webinar: Setting A New Standard of Data Lake Analytics

Reduce costs and improve query response times

Register now