Let’s say I have a BigQuery table “events” (in reality this is a slow sub-query) that stores the count of events per day, by event type. There are many types of events and most of them don’t occur on most days, so there is only a row for day/event type combinations with a non-zero count.
I have a query that returns the count for each event type and day and the count for that event from N days ago, which looks like this:
WITH events AS ( SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count UNION ALL SELECT '2019-06-09', 'a', 2 UNION ALL SELECT '2019-06-10', 'a', 3 UNION ALL SELECT '2019-06-07', 'b', 4 UNION ALL SELECT '2019-06-09', 'b', 5 ) SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count FROM events e1 LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago ORDER BY 1, 2
The query is slow. BigQuery best practices recommend using window functions instead of self-joins. Is there a way to do this here? I could use the LAG
function if there was a row for each day, but there isn’t. Can I “pad” it somehow? (There isn’t a short list of possible event types. I could of course join to SELECT DISTINCT type FROM events
, but that probably won’t be faster than the self-join.)
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count FROM `project.dataset.events` WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)
If t apply to sample data from your question – result is:
Row day type count prev_count 1 2019-06-08 a 1 0 2 2019-06-09 a 2 0 3 2019-06-10 a 3 1 4 2019-06-07 b 4 0 5 2019-06-09 b 5 4