Skip to content
Advertisement

Finding the maximum distance between events in a group in gdelt-bq.full dataset, BigQuery

I need to find the longest distance across all the event points for each country in gdelt-bq.full:events dataset. For having information about countries make groups there is a join with gdelt-bq:extra.countryinfo. So now I have this table:

The difficulty is that there are around 50k events in total, and maximum within a group is 15K (for US) and I need to calculate all the distances within a group first (events of one country), but all the events latitude and longitude are in one column. So I need to create all the pairs of events to calculate all of them and find the longest, which for the biggest group is number of 2-combinations from 15k set which is around 11kk.

By the way I found a haversine function for calculating the distance here HAVERSINE distance in BigQuery?

Maybe there’s something wrong in my approach? Any help appreciated.

Advertisement

Answer

Below is for BigQuery Standard SQL
Here, instead of focusing on events, code rather goes off of lat, long which greatly reduces volume needed to be processed thus avoiding famous “Resources exceeded …”

As an example from output (be patient – it took about 12-13 min for me to run above query) – top 5 countries (by distance) are as below:

Note: instead of distance function used here you can use any function of your choice – for example you can use HAVERSINE distance function you mentioned in your question (does not matter)

Also you can control how many top distances you want in output by changing LIMIT 1 in ARRAY_AGG() to LIMIT 3 for example or whatever number you want

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement