Is there a way to change this BigQuery self-join to use a window function?

Let’s say I have a BigQuery table “events” (in reality this is a slow sub-query) that stores the count of events per day, by event type. There are many types of events and most of them don’t occur on most days, so there is only a row for day/event type combinations with a non-zero count.

I have a query that returns the count for each event type and day and the count for that event from N days ago, which looks like this:

WITH events AS (
  SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
  UNION ALL SELECT '2019-06-09', 'a', 2
  UNION ALL SELECT '2019-06-10', 'a', 3
  UNION ALL SELECT '2019-06-07', 'b', 4
  UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2

​x
 
WITH events AS (  SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count  UNION ALL SELECT '2019-06-09', 'a', 2  UNION ALL SELECT '2019-06-10', 'a', 3  UNION ALL SELECT '2019-06-07', 'b', 4  UNION ALL SELECT '2019-06-09', 'b', 5)SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_countFROM events e1LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days agoORDER BY 1, 2​

The query is slow. BigQuery best practices recommend using window functions instead of self-joins. Is there a way to do this here? I could use the LAG function if there was a row for each day, but there isn’t. Can I “pad” it somehow? (There isn’t a short list of possible event types. I could of course join to SELECT DISTINCT type FROM events, but that probably won’t be faster than the self-join.)

Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count
FROM `project.dataset.events`
WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)

 
#standardSQLSELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_countFROM `project.dataset.events`WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)   ​

If t apply to sample data from your question – result is:

Row day         type    count   prev_count   
1   2019-06-08  a       1       0    
2   2019-06-09  a       2       0    
3   2019-06-10  a       3       1    
4   2019-06-07  b       4       0    
5   2019-06-09  b       5       4

 
Row day         type    count   prev_count   1   2019-06-08  a       1       0    2   2019-06-09  a       2       0    3   2019-06-10  a       3       1    4   2019-06-07  b       4       0    5   2019-06-09  b       5       4    ​

Advertisement

Answer