I need to find the longest distance across all the event points for each country in gdelt-bq.full:events
dataset. For having information about countries make groups there is a join with gdelt-bq:extra.countryinfo
.
So now I have this table:
SELECT * FROM [gdelt-bq:full.events] events JOIN [gdelt-bq:extra.countryinfo] countries ON events.Actor1CountryCode = countries.iso3
The difficulty is that there are around 50k events in total, and maximum within a group is 15K (for US) and I need to calculate all the distances within a group first (events of one country), but all the events latitude and longitude are in one column. So I need to create all the pairs of events to calculate all of them and find the longest, which for the biggest group is number of 2-combinations from 15k set which is around 11kk.
By the way I found a haversine function for calculating the distance here HAVERSINE distance in BigQuery?
Maybe there’s something wrong in my approach? Any help appreciated.
Advertisement
Answer
Below is for BigQuery Standard SQL
Here, instead of focusing on events, code rather goes off of lat, long which greatly reduces volume needed to be processed thus avoiding famous “Resources exceeded …”
#standardSQL CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64) RETURNS FLOAT64 AS (( WITH constants AS ( SELECT 0.017453292519943295 AS p ) SELECT ROUND(12742 * ASIN(SQRT( 0.5 - COS((lat2 - lat1) * p)/2 + COS(lat1 * p) * COS(lat2 * p) * (1 - COS((lon2 - lon1) * p))/2)), 2) FROM constants )); WITH events AS ( SELECT * FROM `gdelt-bq.full.events` WHERE NOT(IFNULL(ActionGeo_Lat,0)=0 OR IFNULL(ActionGeo_Long,0)=0) ), geos AS ( SELECT DISTINCT ActionGeo_CountryCode code, ActionGeo_Lat lat, ActionGeo_Long long FROM events ) SELECT c.code code, country, geo[safe_offset(0)].* FROM ( SELECT code, ARRAY_AGG(STRUCT(dist, lat1, long1, lat2, long2) ORDER BY dist DESC LIMIT 1) AS geo FROM ( SELECT e1.code code, e1.Lat lat1, e1.Long long1, e2.Lat lat2, e2.Long long2, distance(e1.Lat, e1.Long, e2.Lat, e2.Long) dist FROM geos e1 JOIN geos e2 ON e1.code = e2.code AND e1.Lat > e2.Lat ) GROUP BY code ) c JOIN `gdelt-bq.extra.countryinfo` countries ON c.code = countries.iso -- ORDER BY dist DESC
As an example from output (be patient – it took about 12-13 min for me to run above query) – top 5 countries (by distance) are as below:
code country dist lat1 long1 lat2 long2 US United States 13468.78 18.1131 -65.3531 8.7318 167.74 MP Northern Mariana Islands 10508.24 16.0 146.0 -20.0 57.0 PF French Polynesia 9403.5 15.7833 111.2 6.339869976043701 -162.6750030517578 LS Lesotho 8741.97 47.2333 9.51667 -29.5 28.5 RS Serbia 8075.75 54.4922 168.12 43.4151 39.9248
Note: instead of distance function used here you can use any function of your choice – for example you can use HAVERSINE distance function you mentioned in your question (does not matter)
Also you can control how many top distances you want in output by changing LIMIT 1 in ARRAY_AGG() to LIMIT 3 for example or whatever number you want