Skip to content
Advertisement

Window Function for First Instance in Repeating Time Periods

I’m trying to translate the logic below into one or more SQL statements:

  1. Find the first occurrence of a given event for a user
  2. If the user has any additional events on that day or 30 days after, exclude those events. Keep only the first event.
  3. 30 days after the original event, you may consider new events, but the original logic applies. Take only the first new event, then exclude any other in the 30 days after the second event.
  4. Repeat for all events

Finding the first event is easy, as is excluding all events that occur in the 30 days after it. I’m having trouble trying to identify all the following windows/events after the first without looping or scripting.

My solution needs to support multiple RDBMS (MSSQL and Spark SQL at this time, potentially more), so the solution needs to be as close to standard SQL as possible, no platform-specific scripting. I also would like to avoid correlated subqueries if possible for performance reasons. But if that’s the only way to solve the problem, I’m open to that as a solution.

Example data:

UserID  EventDate
1       2022-01-02 
1       2022-01-19 
1       2022-02-01 
1       2022-02-07 
1       2022-02-08 
1       2022-03-19 
2       2022-01-04
2       2022-01-05
2       2022-01-06
2       2022-02-22

Desired output:

UserID  EventDate   Include
1       2022-01-02  1
1       2022-01-19  0
1       2022-02-01  0
1       2022-02-07  1
1       2022-02-08  0
1       2022-03-19  1
2       2022-01-04  1
2       2022-01-05  0
2       2022-01-06  0
2       2022-02-22  1

Or:

UserID  EventDate
1       2022-01-02
1       2022-02-07
1       2022-03-19
2       2022-01-04
2       2022-02-22

Advertisement

Answer

This should work on most DB.

drop table if exists #have;

create table #have
(
  UserID   [int]
, date     [date]
)
;

insert into #have
values
  (1, '2022-01-02')
, (1, '2022-01-19')
, (1, '2022-02-01')
, (1, '2022-02-07')
, (1, '2022-02-08')
, (1, '2022-03-19')
, (2, '2022-01-04')
, (2, '2022-01-05')
, (2, '2022-01-06')
, (2, '2022-02-22')
;

with c1 as
(
select *
     , isnull(datediff(day, lag(date, 1) over(partition by UserID 
                          order by date), date), 0) as diff 

from #have
)
,
c2 as
(
select *, sum(diff) over(partition by UserID 
                         order by date) as s
        , sum(diff) over(partition by UserID 
                         order by date) / 31 + 1 as thirty
from c1
)
,
c3 as
(
select UserID
     , min(date) as date
from c2
group by UserID, thirty
)

select * from c3 
order by UserID, date
;
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement