A very basic geospatial join, based on this example, times out every time.
polygons contains 340K polygons, while
points contains 5K rows with latitude/longitude pairs (and an ID). Both are single .csv files in S3.
SELECT poly.geometry, p.id FROM polygons as poly CROSS JOIN points as p WHERE ST_CONTAINS (ST_POLYGON(poly.geometry), ST_POINT(p.lon, p.lat));
The SQL query above never completes in the default 30-minute Athena query time limit.
I’ve found vanilla Athena queries on large-ish data sets are fairly performant, but I’m not sure whether the geospatial functionality can handle (million x million) joins, let alone (billion x million) joins.
Thought I’d ask before moving back to EMR+Spark solution.
Athena service is based on Presto 0.172. That version did not have geospatial join, so your query if single-node cross-join + filter.
Newer Presto versions have the optimization so they can execute such query in a distributed manner.
Besides going back to EMR, you can try Starburst Presto for AWS which gives you easy to launch best Presto version prepackaged for AWS. (note: I work for Starburst)