Skip to content
Advertisement

Partitioning rows into groups by accumulative time interval

I got a search sessions log that looks like this:

+----------+-------------------------+----------+
|    dt    |       search_time       | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 |        1 |
+----------+-------------------------+----------+

My task is to partition each row into session groups. Session groups are up to five minutes.

For example:

Those TOP 3 sessions will form a group session 1 – if we accumulate the minutes between each row, we will get 3 minutes and the 4th would accumulate to more then 5 minutes so it will be a different session group.

+----------+-------------------------+----------+---------------+
|    dt    |       search_time       | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |             1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |             2 |
+----------+-------------------------+----------+---------------+

I manipulated the table like this in order to get it ready for partitioning:

WITH [Sub Table] AS
(

SELECT   [dt]
        ,[search_time]
        ,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
        ,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
        ,[searches]
FROM     [search_session]
)
SELECT
    [dt],
    [search_time],
    [pervious search time],
    [min diff],
    [searches]
FROM [Sub Table]

And got this:

+----------+-------------------------+-------------------------+----------+----------+
|    dt    |       search_time       |  pervious search time   | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL                    |        0 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 |       37 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 |      160 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 |      100 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 |      105 |        1 |
+----------+-------------------------+-------------------------+----------+----------+

I thought about two possibilities to continue:

  1. Using a window function, like RANK(), I can partition the rows, but I can’t figure out how to set the PARTITION BY caluse with a condition to do so.

  2. To iterate the table with a WHILE loop – again finding hard time to form ths

Advertisement

Answer

This cannot be done with just window functions. You need some kind of iterative process, that keeps track of the first row of each group, and dynamically identififes the next one.

In SQL, you can phrase this with a recursive query:

with 
    data as (select t.*, row_number() over(order by search_time) rn from mytable t),
    cte as (
        select d.*, search_time as first_search_time 
        from data d
        where rn = 1
        union all 
        select d.*, 
            case when d.search_time > dateadd(minute, 5, c.first_search_time)
                then d.search_time
                else c.first_search_time
            end
        from cte c 
        inner join data d on d.rn = c.rn + 1
    )
select c.*, dense_rank() over(order by first_search_time) grp 
from cte c

For your sample data, this returns:

dt         | search_time             | searches | rn | first_search_time       | grp
:--------- | :---------------------- | -------: | -: | :---------------------- | --:
2020-06-01 | 2020-06-01 00:36:38.000 |        1 |  1 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 00:37:38.000 |        1 |  2 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 00:39:18.000 |        1 |  3 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 01:16:18.000 |        1 |  4 | 2020-06-01 01:16:18.000 |   2
2020-06-01 | 2020-06-01 03:56:38.000 |        1 |  5 | 2020-06-01 03:56:38.000 |   3
2020-06-01 | 2020-06-01 05:36:38.000 |        1 |  6 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:37:38.000 |        1 |  7 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:39:38.000 |        1 |  8 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:41:38.000 |        1 |  9 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 07:26:38.000 |        1 | 10 | 2020-06-01 07:26:38.000 |   5
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement