Let’s say I have a BigQuery table “events” (in reality this is a slow sub-query) that stores the count of events per day, by event type. There are many types of events and most of them don’t occur on most days, so there is only a row for day/event type combinations with a non-zero count.
I have a query that returns the count for each event type and day and the count for that event from N days ago, which looks like this:
WITH events AS (
SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
UNION ALL SELECT '2019-06-09', 'a', 2
UNION ALL SELECT '2019-06-10', 'a', 3
UNION ALL SELECT '2019-06-07', 'b', 4
UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2
The query is slow. BigQuery best practices recommend using window functions instead of self-joins. Is there a way to do this here? I could use the LAG
function if there was a row for each day, but there isn’t. Can I “pad” it somehow? (There isn’t a short list of possible event types. I could of course join to SELECT DISTINCT type FROM events
, but that probably won’t be faster than the self-join.)
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count
FROM `project.dataset.events`
WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)
If t apply to sample data from your question – result is:
Row day type count prev_count
1 2019-06-08 a 1 0
2 2019-06-09 a 2 0
3 2019-06-10 a 3 1
4 2019-06-07 b 4 0
5 2019-06-09 b 5 4