I’m trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I’ve got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.
I’m looking for a more effective way of achieving the same result.
Table looks something like this:
+---+------------+-------------+ | | date | userid | +---+------------+-------------+ | 1 | 2013-05-14 | xxxxx | | 2 | 2017-03-14 | xxxxx | | 3 | 2018-01-24 | xxxxx | | 4 | 2013-03-21 | xxxxx | | 5 | 2014-03-19 | xxxxx | | 6 | 2015-09-03 | xxxxx | | 7 | 2014-02-06 | xxxxx | | 8 | 2014-10-30 | xxxxx | | ..| ... | ... | +---+------------+-------------+
Format of the desired result:
+---+------------+---------------------------------------------+ | | date | active_users_7_days | active_users_90_days | +---+------------+---------------------------------------------+ | 1 | 2013-05-14 | 1240 | 34339 | | 2 | 2017-03-14 | 4334 | 54343 | | 3 | 2018-01-24 | ..... | ..... | | 4 | 2013-03-21 | ..... | ..... | | 5 | 2014-03-19 | ..... | ..... | | 6 | 2015-09-03 | ..... | ..... | | 7 | 2014-02-06 | ..... | ..... | | 8 | 2014-10-30 | ..... | ..... | | ..| ... | ..... | ..... | +---+------------+---------------------------------------------+
My query looks like this:
#standardSQL WITH T1 AS( SELECT date, STRING_AGG(DISTINCT userid) AS IDs FROM `consumer.events` GROUP BY date ), T2 AS( SELECT date, STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING AND CURRENT ROW) AS IDs FROM T1 ) SELECT date, ( SELECT COUNT(DISTINCT (userid)) FROM UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays FROM T2
Advertisement
Answer
Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:
For an exact count, this would work (but gets slower as the window gets larger):
#standardSQL SELECT DATE_SUB(date, INTERVAL i DAY) date_grp , COUNT(DISTINCT owner_user_id) unique_90_day_users , COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users , COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users FROM ( SELECT DATE(creation_date) date, owner_user_id FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE EXTRACT(YEAR FROM creation_date)=2017 GROUP BY 1, 2 ), UNNEST(GENERATE_ARRAY(1, 90)) i GROUP BY 1 ORDER BY date_grp
The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):
#standardSQL SELECT DATE_SUB(date, INTERVAL i DAY) date_grp , HLL_COUNT.MERGE(sketch) unique_90_day_users , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users FROM ( SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE EXTRACT(YEAR FROM creation_date)=2017 GROUP BY 1 ), UNNEST(GENERATE_ARRAY(1, 90)) i GROUP BY 1 ORDER BY date_grp
Updated query that gives correct results – removing rows with less than 90 days (works when no dates are missing):
#standardSQL SELECT DATE_SUB(date, INTERVAL i DAY) date_grp , HLL_COUNT.MERGE(sketch) unique_90_day_users , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users , COUNT(*) window_days FROM ( SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE EXTRACT(YEAR FROM creation_date)=2017 GROUP BY 1 ), UNNEST(GENERATE_ARRAY(1, 90)) i GROUP BY 1 HAVING window_days=90 ORDER BY date_grp