Skip to content
Advertisement

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:

DATE USERID EVENTS
2021-08-27 1 5
2021-07-25 1 7
2021-07-23 2 3
2021-07-20 3 9
2021-06-22 1 9
2021-05-05 1 4
2021-05-05 2 2
2021-05-05 3 6
2021-05-05 4 8
2021-05-05 5 1

I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.

DATE ACTIVE_USERS
2021-08-27 1
2021-07-25 3
2021-07-23 2
2021-07-20 2
2021-06-22 1
2021-05-05 5

I tried the following query which returned only the users who were active on the specified date:

SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;

I also tried using a window function with rows between but seems to end up getting the same result:

SELECT
    DATE,
    SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
    DATE,
    CASE
        WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
        ELSE 0
    END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1

I’m using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.

Advertisement

Answer

This is tricky to do as window functions — because count(distinct) is not permitted. You can use a self-join:

select t1.date, count(distinct t2.userid)
from table t join
     table t2
     on t2.date <= t.date and
        t2.date > t.date - interval '30 day'
group by t1.date;

However, that can be expensive. One solution is to “unpivot” the data. That is, do an incremental count per user of going “in” and “out” of active states and then do a cumulative sum:

with d as (  -- calculate the dates with "ins" and "outs"
      select user, date, +1 as inc
      from table
      union all
      select user, date + interval '30 day', -1 as inc
      from table
     ),
     d2 as (  -- accumulate to get the net actives per day
      select date, user, sum(inc) as change_on_day,
             sum(sum(inc)) over (partition by user order by date) as running_inc
      from d
      group by date, user
     ),
     d3 as (  -- summarize into active periods
      select user, min(date) as start_date, max(date) as end_date
      from (select d2.*,
                   sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
            from d2
           ) d2
      where running_inc > 0
      group by user
     )
select d.date, count(d3.user)
from (select distinct date from table) d left join
     d3
     on d.date >= start_date and d.date < end_date
group by d.date;
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement