Skip to content
Advertisement

How to get an array of user_id’s being active at a specific point in time based on their latest event in BigQuery?

I have a problem where I’d like to get all user_id’s in the “active” state for each day. An event is recorded only when the state changes for a user. The state for each user_id should be preserved until an “inactive” event is fired (see example data and outcome). Different users can have their state changed during the same day.

How do I do this? I have tried working with ARRAY_AGG and also grouping the two different events and using lag based on this answer. I get stuck at the phase where I would then need to subtract the user_id’s that get the inactive event from the array.

SELECT DATE("2019-11-11") as date, 1 as user_id, "inactive" as state UNION ALL 
SELECT DATE("2019-11-12"), 2, "active" UNION ALL   
SELECT DATE("2019-11-13"), 1, "active" UNION ALL 
SELECT DATE("2019-11-14"), 1, "inactive" UNION ALL
SELECT DATE("2019-11-14"), 3, "active" UNION ALL 
SELECT DATE("2019-11-15"), 2, "inactive"

Desired output:

date      | active_users

2019-11-11| []
2019-11-12| [2]
2019-11-13| [2,1]
2019-11-14| [2,3]
2019-11-15| [3]

Appreciate the all and any help I can get!

Advertisement

Answer

One method is to generate a series and aggregate. First, get the range of days for activity:

select t.*, date_add(next_inactive, interval -1 day)
from (select t.*,
             min(case when status = 'inactive' then date end) over (partition by user_id order by date desc) as next_inactive,
             max(date) over () as max_date
      from t 
     ) t
where state = 'active'

Then generate the dates and aggregate:

select day, array_agg(user_id)
from (select t.*, date_add(next_inactive, interval -1 day) s last_active_date
      from (select t.*,
                   min(case when status = 'inactive' then date end) over (partition by user_id order by date desc) as next_inactive,
                   max(date) over () as max_date
            from t 
           ) t
      where state = 'active'
     ) t cross join
     unnest(generate_date_array(date, coalesce(last_active_date, max_date), interval 1 day) day
group by day;
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement