Skip to content
Advertisement

Finding the maximum distance between events in a group in gdelt-bq.full dataset, BigQuery

I need to find the longest distance across all the event points for each country in gdelt-bq.full:events dataset. For having information about countries make groups there is a join with gdelt-bq:extra.countryinfo. So now I have this table:

SELECT *
FROM [gdelt-bq:full.events] events JOIN
     [gdelt-bq:extra.countryinfo] countries
     ON events.Actor1CountryCode = countries.iso3

The difficulty is that there are around 50k events in total, and maximum within a group is 15K (for US) and I need to calculate all the distances within a group first (events of one country), but all the events latitude and longitude are in one column. So I need to create all the pairs of events to calculate all of them and find the longest, which for the biggest group is number of 2-combinations from 15k set which is around 11kk.

By the way I found a haversine function for calculating the distance here HAVERSINE distance in BigQuery?

Maybe there’s something wrong in my approach? Any help appreciated.

Advertisement

Answer

Below is for BigQuery Standard SQL
Here, instead of focusing on events, code rather goes off of lat, long which greatly reduces volume needed to be processed thus avoiding famous “Resources exceeded …”

#standardSQL
CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64 AS ((
WITH constants AS (
  SELECT 0.017453292519943295 AS p
) 
SELECT ROUND(12742 * ASIN(SQRT(
  0.5 - COS((lat2 - lat1) * p)/2 + 
  COS(lat1 * p) * COS(lat2 * p) * 
  (1 - COS((lon2 - lon1) * p))/2)), 2)
FROM constants
));
WITH events AS (
  SELECT * 
  FROM `gdelt-bq.full.events` 
  WHERE NOT(IFNULL(ActionGeo_Lat,0)=0 OR IFNULL(ActionGeo_Long,0)=0)
),
geos AS (
  SELECT DISTINCT ActionGeo_CountryCode code, ActionGeo_Lat lat, ActionGeo_Long long
  FROM events
)
SELECT c.code code, country, geo[safe_offset(0)].*
FROM (
  SELECT code, ARRAY_AGG(STRUCT(dist, lat1, long1, lat2, long2) ORDER BY dist DESC LIMIT 1) AS geo
  FROM (
    SELECT e1.code code, e1.Lat lat1, e1.Long long1, e2.Lat lat2, e2.Long long2,
      distance(e1.Lat, e1.Long, e2.Lat, e2.Long) dist
    FROM geos e1
    JOIN geos e2
    ON  e1.code = e2.code
    AND e1.Lat > e2.Lat
  )
  GROUP BY code
) c
JOIN `gdelt-bq.extra.countryinfo` countries
ON c.code = countries.iso
-- ORDER BY dist DESC

As an example from output (be patient – it took about 12-13 min for me to run above query) – top 5 countries (by distance) are as below:

code    country                 dist        lat1    long1       lat2    long2    
US  United States               13468.78    18.1131 -65.3531    8.7318  167.74   
MP  Northern Mariana Islands    10508.24    16.0    146.0       -20.0   57.0     
PF  French Polynesia             9403.5     15.7833 111.2   6.339869976043701   -162.6750030517578   
LS  Lesotho                      8741.97    47.2333 9.51667     -29.5   28.5     
RS  Serbia                       8075.75    54.4922 168.12      43.4151 39.9248  

Note: instead of distance function used here you can use any function of your choice – for example you can use HAVERSINE distance function you mentioned in your question (does not matter)

Also you can control how many top distances you want in output by changing LIMIT 1 in ARRAY_AGG() to LIMIT 3 for example or whatever number you want

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement