In our last post, we discussed two methods for running geospatial analysis with Trino and the Hive connector and explored a few optimization techniques to improve performance/cost. If you recall, we were able to improve performance by a factor of 1.2-6.8x for our queries but still ended up with ~30s response time when querying a 4.4B rows dataset on a 4 nodes cluster.
In this post, we’ll explore how the Starburst Warp Speed can significantly improve the cost/performance ratio, instantly accelerate and improve the interactivity for Trino geospatial queries, enabling large-scale analysis.
What’s missing for enabling a geospatial analysis at scale
Geospatial analytics on big data is particularly challenging because geospatial operations such as spatial joins, nearest neighbor, or point-in-polygon are inherently highly selective and computationally expensive without using an auxiliary data structure to map the geo-locations to the underlying rows. Many Trino users want to leverage their existing investment in Trino to run spatial analysis at scale — but face challenges in balancing cost/performance with business needs.
A common question on the Trino community forum and Slack channel is whether partitioning or indexing could enable faster performance of geospatial data analysis. Here are a few examples:
Question #1
“I’m looking for a solution to query massive point cloud stored in iceberg with Trino. While doing some tests, I was wondering if there is a way to improve geospatial queries in Trino? something like, building and storing a KDTree, pruning unwanted files by using a geohash or something ?”
Question #2
“Hello, I’m evaluating Trino (very green just reading docs to see if it fits our use case) and have some questions about its spatial capabilities. Does it use spatial indexes for joins between different systems? E.g. When joining points in Hive to polygons elsewhere, will it create spatial indexes in presto itself for the join?”
Question #3
Hi all, I am curious to hear about the community’s experiences with large spatial joins and their performance. We’re evaluating Trino for ad-hoc analytics on our data in AWS (in parquet).
Answer:
In the event that you are dealing with a very large geo data set, partitioning your events into smaller time intervals will allow the event data to be broadcast. You can also partition the geo data into regions, and run a larger number of smaller queries against individual or subsets of regions.
Challenges with partitioning data, by region
Solutions that often come up deal with partitioning data by region, or with suggestions to have any kind of index within Trino.
It’s worth mentioning that partitioning by high cardinality columns such as region might help a bit, but will also surface new challenges such as long-tailed data distribution across partitions and extremely small files per partition, which can significantly degrade query performance.
Dynamic geospatial indexing with Starburst Warp Speed
Starburst Warp Speed automatically accelerates queries with smart indexing and caching. It’s completely plug-and-play, just start running your queries and Warp Speed will create indexes and help you deliver the best possible performance and concurrency for geospatial queries.
Running the same experiment as before (detailed setup, queries, and results at the end of the article), now using the Starburst Warp Speed, we were able to improve performance by a factor of 5X on average!
With Warp Speed, you can run geospatial analysis at scale, while boosting cost/performance, future-proofing SLA, and growing the volume and dimensionality of analyzed data.
Mini benchmark data
Setup
4Xr6gd.4xlarge machines on AWS were used for executing queries on both Starburst’s Hive connector and Starburst Hive connector which is accelerated with the Warp Speed technology.
Results
Use Case | Query | Hive (s) | Warp Speed(s) | Warp speed improvement vs. Hive |
Use case #1 | ||||
Distance from points of interest 100m | Query #3.1 | 29 | 20 | 1.45 |
Use case #2 | ||||
Distance from points of interest 500m | Query #3 | 27.5 | 4 | 6.875 |
Use case #3 | ||||
lookups in a geofence | Query #4 | 33 | 5 | 6.6 |
Queries
Query #1 (500M)
SELECT
count(distinct rider_id)
FROM
hive.ride_sharing_dataset.trips_data_bing as t,
hive.ride_sharing_dataset.places as p
WHERE
great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;
Query #1.1 (100M)
SELECT
count(distinct rider_id)
FROM
hive.ride_sharing_dataset.trips_data_bing as t,
hive.ride_sharing_dataset.places as p
WHERE
great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.1;
Query #2
SELECT
count(distinct driver_id)
FROM
ride_sharing_dataset.trips_data_bing t
WHERE
ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752, -122.45635986328125 37.072130604487681))’),
st_point(lon,lat));
Query #2.1 (adding bounding box with lat/lon)
SELECT
count(distinct driver_id)
FROM
ride_sharing_dataset.trips_data_bing t
WHERE
lon between -122.45635986328125 and -122.45567321777344
and lat between 37.72130604487681 and 37.72184917678752
and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),
st_point(lon,lat));
Query #3
with p as (
SELECT
lat,lon,qkey
FROM
ride_sharing_dataset.places
cross join unnest (transform(bing_tiles_around(lat,lon,19,0.5),t -> bing_tile_quadkey(t))) as t(qkey))
— use that list to filter matching tiles
Query #3.1
SELECT
count(distinct rider_id)
FROM
ride_sharing_dataset.trips_data_bing as t,
WHERE
v.qkey = r.qkey
— match the bounding tile
and great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;
Query #4
with q as (
SELECT
qkey
FROM
(
values (‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752)) ‘) ) as a(p)
cross join unnest (transform(geometry_to_bing_tiles(ST_Polygon(a.p),19),
t -> bing_tile_quadkey(t))) as t(qkey) )
SELECT
count(distinct driver_id)
FROM
ride_sharing_dataset.trips_data_bing t,
q
WHERE
q.qkey = t.qkey
and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),
st_point(lon,lat));
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).