Skip to content
Advertisement

Group rows by datediff and then use diff function

I have table in PostgreSQL with timestamp column created_at and integer column user_id.

id | created_at | user_id
1 | 2019-10-14 09:26:53.813 | 1
2 | 2019-10-14 09:26:54.813 | 1
3 | 2019-10-14 09:46:53.813 | 1
4 | 2019-10-14 09:46:54.813 | 2
5 | 2019-10-14 09:46:55.813 | 1
6 | 2019-10-14 09:46:56.813 | 1
7 | 2019-10-14 09:46:57.813 | 2

Each row represents some action from user. I need to calculate average user session length. Session is defined as group of actions with time difference less than 10 minutes. When there is difference 10 minutes or more between two user actions, new session started.

I was thinking about:

  1. Group rows by user_id.
  2. Group rows (somehow) by difference between current and next item for single user. (to get sessions)
  3. Calculate diff between first and last row in one group (session). (to get session lengths)
  4. Calculate average session length. (to get average session length)

But I am not able to write it in SQL.

Could you give me some advice/example how can be this done in SQL?

Advertisement

Answer

step-by-step demo:db<>fiddle

SELECT 
    user_id,
    AVG(diff)
FROM (
    SELECT DISTINCT
        user_id,
        group_id,
        first_value(created_at) OVER (PARTITION BY user_id, group_id ORDER BY created_at DESC)
            - first_value(created_at) OVER (PARTITION BY user_id, group_id ORDER BY created_at) as diff
    FROM (
        SELECT
            id, created_at, user_id,
            SUM(group_id) OVER (PARTITION BY user_id ORDER BY created_at) AS group_id
        FROM (
            SELECT
                *,
                (created_at 
                    - lag(created_at, 1, created_at) OVER (PARTITION BY user_id ORDER BY created_at)
                    > interval '10 minutes')::int AS group_id
            FROM
                mytable   
        )s
    )s
)s
GROUP BY user_id
  1. (created_at - lag(created_at, 1, created_at) OVER (PARTITION BY user_id ORDER BY created_at) > interval '10 minutes')::int AS group_id: the lag() window function takes the created_at value from the previous record (second parameter: step length, third parameter: default value if no previous record == current value) within the ordered user_id partition (group). Then the difference between current created_at value an the previous one is calculated. If this value is > 10 minutes, the result is true; false otherwise. This boolean can be casted into an int value, which results in 0 or 1
  2. Cumulative SUM() over the 0/1 values, which results in group_ids for every new session per user_id
  3. The first created_at timestamp per user_id and session group_id can be get by first_value() window function, the last one by first_value() and DESC order. The difference gives you the length of each session. The DISTINCT clause is used, because the difference value is put onto every record. But we need this only once.
  4. Finally, you can group and AVG() this difference for the users.
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement