I have a table of users and how many events they fired on a given date:
DATE | USERID | EVENTS |
---|---|---|
2021-08-27 | 1 | 5 |
2021-07-25 | 1 | 7 |
2021-07-23 | 2 | 3 |
2021-07-20 | 3 | 9 |
2021-06-22 | 1 | 9 |
2021-05-05 | 1 | 4 |
2021-05-05 | 2 | 2 |
2021-05-05 | 3 | 6 |
2021-05-05 | 4 | 8 |
2021-05-05 | 5 | 1 |
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE | ACTIVE_USERS |
---|---|
2021-08-27 | 1 |
2021-07-25 | 3 |
2021-07-23 | 2 |
2021-07-20 | 2 |
2021-06-22 | 1 |
2021-05-05 | 5 |
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE FROM table WHERE DATE >= (CURRENT_DATE() - interval '30 days') GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT DATE, SUM(ACTIVE_USERS) AS ACTIVE_USERS FROM ( SELECT DATE, CASE WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1 ELSE 0 END AS ACTIVE_USERS FROM table ) GROUP BY 1 ORDER BY 1
I’m using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
Advertisement
Answer
This is tricky to do as window functions — because count(distinct)
is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid) from table t join table t2 on t2.date <= t.date and t2.date > t.date - interval '30 day' group by t1.date;
However, that can be expensive. One solution is to “unpivot” the data. That is, do an incremental count per user of going “in” and “out” of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs" select user, date, +1 as inc from table union all select user, date + interval '30 day', -1 as inc from table ), d2 as ( -- accumulate to get the net actives per day select date, user, sum(inc) as change_on_day, sum(sum(inc)) over (partition by user order by date) as running_inc from d group by date, user ), d3 as ( -- summarize into active periods select user, min(date) as start_date, max(date) as end_date from (select d2.*, sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period from d2 ) d2 where running_inc > 0 group by user ) select d.date, count(d3.user) from (select distinct date from table) d left join d3 on d.date >= start_date and d.date < end_date group by d.date;