Improve SQL query to find range between start and end date

I’m working with a database called international_education from the world_bank_intl_education dataset of bigquery-public-data.

FIELDS            
country_name      
country_code      
indicator_name    
indicator_code    
value   
year

​x
 
FIELDS            country_name      country_code      indicator_name    indicator_code    value   year​

My aim is to plot a line graph with countries who have had the biggest and smallest change in Population growth (annual %) (one of the indicator_name values).

I have done this below using two partitions finding the first and last value of the year by each country but I’m rough on my SQL and am wondering if there is a way to optimize this formula.

query = """
WITH differences AS
    (
    SELECT country_name, year, value, 
      FIRST_VALUE(value)
      OVER (
          PARTITION BY country_name
          ORDER BY year
          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      ) AS small_value,
      LAST_VALUE(value)
      OVER (
          PARTITION BY country_name
          ORDER BY year
          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      ) AS large_value
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE indicator_name = 'Population growth (annual %)'
    ORDER BY year
    )
SELECT country_name, year, (large_value-small_value) AS total_range, value
FROM differences
ORDER BY total_range
        """

 
query = """WITH differences AS    (    SELECT country_name, year, value,       FIRST_VALUE(value)      OVER (          PARTITION BY country_name          ORDER BY year          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING      ) AS small_value,      LAST_VALUE(value)      OVER (          PARTITION BY country_name          ORDER BY year          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING      ) AS large_value    FROM `bigquery-public-data.world_bank_intl_education.international_education`    WHERE indicator_name = 'Population growth (annual %)'    ORDER BY year    )SELECT country_name, year, (large_value-small_value) AS total_range, valueFROM differencesORDER BY total_range        """​

Convert to pandas dataframe.

df= wbed.query_to_pandas_safe(query)
df.head(10)

 
df= wbed.query_to_pandas_safe(query)df.head(10)​

Resulting table.

    country_name            year    total_range value
0   United Arab Emirates    1970    -13.195183  14.446942
1   United Arab Emirates    1971    -13.195183  16.881671
2   United Arab Emirates    1972    -13.195183  17.689814
3   United Arab Emirates    1973    -13.195183  17.695296
4   United Arab Emirates    1974    -13.195183  17.125615
5   United Arab Emirates    1975    -13.195183  16.211873
6   United Arab Emirates    1976    -13.195183  15.450884
7   United Arab Emirates    1977    -13.195183  14.530119
8   United Arab Emirates    1978    -13.195183  13.033461
9   United Arab Emirates    1979    -13.195183  11.071306

 
    country_name            year    total_range value0   United Arab Emirates    1970    -13.195183  14.4469421   United Arab Emirates    1971    -13.195183  16.8816712   United Arab Emirates    1972    -13.195183  17.6898143   United Arab Emirates    1973    -13.195183  17.6952964   United Arab Emirates    1974    -13.195183  17.1256155   United Arab Emirates    1975    -13.195183  16.2118736   United Arab Emirates    1976    -13.195183  15.4508847   United Arab Emirates    1977    -13.195183  14.5301198   United Arab Emirates    1978    -13.195183  13.0334619   United Arab Emirates    1979    -13.195183  11.071306​

I would then plot this with python as follows.

all_countries = df.groupby('country_name', as_index=False).max().sort_values(by='total_range').country_name.values

countries = np.concatenate((all_countries[:3], all_countries[-4:]))

plt.figure(figsize=(16, 8))
sns.lineplot(x='year',y='value', data=df[df.country_name.isin(countries)], hue='country_name')

 
all_countries = df.groupby('country_name', as_index=False).max().sort_values(by='total_range').country_name.values​countries = np.concatenate((all_countries[:3], all_countries[-4:]))​plt.figure(figsize=(16, 8))sns.lineplot(x='year',y='value', data=df[df.country_name.isin(countries)], hue='country_name')​

Answer

You don’t need the CTE and you don’t need the window frame definitions. So this should be equivalent:

SELECT country_name, year, value,
       (first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR DESC) -
        first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR)
       ) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)';

 
SELECT country_name, year, value,       (first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR DESC) -        first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR)       ) as total_rangeFROM `bigquery-public-data.world_bank_intl_education.international_education`WHERE indicator_name = 'Population growth (annual %)';​

Note that LAST_VALUE() is finicky with window frame definitions. So I routinely just use FIRST_VALUE() with the order by reversed.

If you want just one row per country, then you need aggregation. BigQuery doesn’t have “first” and “last” aggregation functions, but they are very easy to do with arrays:

SELECT country_name, 
       ((array_agg(value ORDER BY year DESC LIMIT 1))[ordinal(1)] -
        (array_agg(value ORDER BY year LIMIT 1))[ordinal(1)]
       ) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
GROUP BY country_name
ORDER BY total_range;

 
SELECT country_name,        ((array_agg(value ORDER BY year DESC LIMIT 1))[ordinal(1)] -        (array_agg(value ORDER BY year LIMIT 1))[ordinal(1)]       ) as total_rangeFROM `bigquery-public-data.world_bank_intl_education.international_education`WHERE indicator_name = 'Population growth (annual %)'GROUP BY country_nameORDER BY total_range;​

Advertisement

Answer