I’m trying to translate the logic below into one or more SQL statements:
- Find the first occurrence of a given event for a user
- If the user has any additional events on that day or 30 days after, exclude those events. Keep only the first event.
- 30 days after the original event, you may consider new events, but the original logic applies. Take only the first new event, then exclude any other in the 30 days after the second event.
- Repeat for all events
Finding the first event is easy, as is excluding all events that occur in the 30 days after it. I’m having trouble trying to identify all the following windows/events after the first without looping or scripting.
My solution needs to support multiple RDBMS (MSSQL and Spark SQL at this time, potentially more), so the solution needs to be as close to standard SQL as possible, no platform-specific scripting. I also would like to avoid correlated subqueries if possible for performance reasons. But if that’s the only way to solve the problem, I’m open to that as a solution.
Example data:
UserID EventDate 1 2022-01-02 1 2022-01-19 1 2022-02-01 1 2022-02-07 1 2022-02-08 1 2022-03-19 2 2022-01-04 2 2022-01-05 2 2022-01-06 2 2022-02-22
Desired output:
UserID EventDate Include 1 2022-01-02 1 1 2022-01-19 0 1 2022-02-01 0 1 2022-02-07 1 1 2022-02-08 0 1 2022-03-19 1 2 2022-01-04 1 2 2022-01-05 0 2 2022-01-06 0 2 2022-02-22 1
Or:
UserID EventDate 1 2022-01-02 1 2022-02-07 1 2022-03-19 2 2022-01-04 2 2022-02-22
Advertisement
Answer
This should work on most DB.
drop table if exists #have; create table #have ( UserID [int] , date [date] ) ; insert into #have values (1, '2022-01-02') , (1, '2022-01-19') , (1, '2022-02-01') , (1, '2022-02-07') , (1, '2022-02-08') , (1, '2022-03-19') , (2, '2022-01-04') , (2, '2022-01-05') , (2, '2022-01-06') , (2, '2022-02-22') ; with c1 as ( select * , isnull(datediff(day, lag(date, 1) over(partition by UserID order by date), date), 0) as diff from #have ) , c2 as ( select *, sum(diff) over(partition by UserID order by date) as s , sum(diff) over(partition by UserID order by date) / 31 + 1 as thirty from c1 ) , c3 as ( select UserID , min(date) as date from c2 group by UserID, thirty ) select * from c3 order by UserID, date ;