I got a search sessions log that looks like this:
+----------+-------------------------+----------+ | dt | search_time | searches | +----------+-------------------------+----------+ | 20200601 | 2020-06-01 00:36:38.000 | 1 | | 20200601 | 2020-06-01 00:37:38.000 | 1 | | 20200601 | 2020-06-01 00:39:18.000 | 1 | | 20200601 | 2020-06-01 01:16:18.000 | 1 | | 20200601 | 2020-06-01 03:56:38.000 | 1 | | 20200601 | 2020-06-01 05:36:38.000 | 1 | | 20200601 | 2020-06-01 05:37:38.000 | 1 | | 20200601 | 2020-06-01 05:39:38.000 | 1 | | 20200601 | 2020-06-01 05:41:38.000 | 1 | | 20200601 | 2020-06-01 07:26:38.000 | 1 | +----------+-------------------------+----------+
My task is to partition each row into session groups. Session groups are up to five minutes.
For example:
Those TOP 3 sessions will form a group session 1 – if we accumulate the minutes between each row, we will get 3 minutes and the 4th would accumulate to more then 5 minutes so it will be a different session group.
+----------+-------------------------+----------+---------------+ | dt | search_time | searches | group_session | +----------+-------------------------+----------+---------------+ | 20200601 | 2020-06-01 00:36:38.000 | 1 | 1 | | 20200601 | 2020-06-01 00:37:38.000 | 1 | 1 | | 20200601 | 2020-06-01 00:39:18.000 | 1 | 1 | | 20200601 | 2020-06-01 01:16:18.000 | 1 | 2 | +----------+-------------------------+----------+---------------+
I manipulated the table like this in order to get it ready for partitioning:
WITH [Sub Table] AS ( SELECT [dt] ,[search_time] ,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time) ,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0) ,[searches] FROM [search_session] ) SELECT [dt], [search_time], [pervious search time], [min diff], [searches] FROM [Sub Table]
And got this:
+----------+-------------------------+-------------------------+----------+----------+ | dt | search_time | pervious search time | min diff | searches | +----------+-------------------------+-------------------------+----------+----------+ | 20200601 | 2020-06-01 00:36:38.000 | NULL | 0 | 1 | | 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 | 1 | 1 | | 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 | 2 | 1 | | 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 | 37 | 1 | | 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 | 160 | 1 | | 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 | 100 | 1 | | 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 | 1 | 1 | | 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 | 2 | 1 | | 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 | 2 | 1 | | 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 | 105 | 1 | +----------+-------------------------+-------------------------+----------+----------+
I thought about two possibilities to continue:
Using a window function, like RANK(), I can partition the rows, but I can’t figure out how to set the PARTITION BY caluse with a condition to do so.
To iterate the table with a WHILE loop – again finding hard time to form ths
Advertisement
Answer
This cannot be done with just window functions. You need some kind of iterative process, that keeps track of the first row of each group, and dynamically identififes the next one.
In SQL, you can phrase this with a recursive query:
with data as (select t.*, row_number() over(order by search_time) rn from mytable t), cte as ( select d.*, search_time as first_search_time from data d where rn = 1 union all select d.*, case when d.search_time > dateadd(minute, 5, c.first_search_time) then d.search_time else c.first_search_time end from cte c inner join data d on d.rn = c.rn + 1 ) select c.*, dense_rank() over(order by first_search_time) grp from cte c
For your sample data, this returns:
dt | search_time | searches | rn | first_search_time | grp :--------- | :---------------------- | -------: | -: | :---------------------- | --: 2020-06-01 | 2020-06-01 00:36:38.000 | 1 | 1 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:37:38.000 | 1 | 2 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:39:18.000 | 1 | 3 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 01:16:18.000 | 1 | 4 | 2020-06-01 01:16:18.000 | 2 2020-06-01 | 2020-06-01 03:56:38.000 | 1 | 5 | 2020-06-01 03:56:38.000 | 3 2020-06-01 | 2020-06-01 05:36:38.000 | 1 | 6 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:37:38.000 | 1 | 7 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:39:38.000 | 1 | 8 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:41:38.000 | 1 | 9 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 07:26:38.000 | 1 | 10 | 2020-06-01 07:26:38.000 | 5