Skip to content
Advertisement

How to merge different rows in SQL

I have a table with playid, userid, created, and stopped. I want to merge certain rows by keeping the earliest created and the latest stopped within 120 minutes from the earliest created.

See the following table:

playid userid created stopped
1 a01 2021-01-30 02:29:58.559858 2021-01-30 02:40:56.558436
2 a01 2021-01-30 02:41:24.023358 2021-01-30 02:55:24.112713
3 a01 2021-01-30 02:57:30.178579 2021-01-30 03:11:14.866678
4 a01 2021-01-30 03:11:41.098424 2021-01-30 03:22:50.155918
5 a01 2021-01-30 03:23:20.545288 2021-01-30 03:36:37.027486
6 a01 2021-01-30 03:46:10.237971 2021-01-30 03:59:17.526151
7 a01 2021-01-30 03:59:57.020326 2021-01-30 04:16:25.948311
8 a01 2021-01-30 04:26:28.968568 2021-01-30 04:30:11.863079
9 a01 2021-01-30 04:30:14.43786 2021-01-30 04:41:42.231915
10 a01 2021-01-30 04:43:17.669945 2021-01-30 04:45:10.443101
11 a01 2021-01-30 04:45:14.50346 2021-01-30 04:47:07.082114
12 a01 2021-01-30 04:47:10.334574 2021-01-30 04:49:30.96017
13 a01 2021-01-30 04:49:34.146011 2021-01-30 04:50:54.45988
14 a01 2021-01-30 04:50:57.948305 2021-01-30 05:11:01.246284
15 a01 2021-01-30 05:39:29.387396 2021-01-30 05:41:39.508654
16 a01 2021-01-30 05:41:44.524951 2021-01-30 05:43:38.231266
17 a01 2021-01-30 05:43:40.785809 2021-01-30 05:54:40.711381
18 a01 2021-01-30 05:55:10.851725 2021-01-30 05:58:24.262351
19 a01 2021-01-30 05:58:29.43821 2021-01-30 06:00:50.870644
20 a01 2021-01-30 06:00:54.168696 2021-01-30 06:12:37.210234
21 a01 2021-01-30 14:41:01.529666 2021-01-30 14:53:01.349479
22 a01 2021-01-30 14:53:48.053136 2021-01-30 15:05:09.532786
23 a01 2021-01-31 15:11:08.547111 2021-01-31 15:11:37.564802
24 a01 2021-01-31 15:11:40.248499 2021-01-31 15:13:40.21787
25 a01 2021-01-31 15:13:59.373145 2021-01-31 15:31:54.099898
26 a01 2021-01-31 15:32:23.20448 2021-01-31 15:46:33.993751
27 a01 2021-01-31 16:55:19.141051 2021-01-31 17:03:43.464444

I want the query to give this table (see below). Please note that the next created is after the previous stopped, meaning the created with playid “8” is after stopped with playid “1”:

playid userid created stopped
1 a01 2021-01-30 02:29:58.559858 2021-01-30 04:16:25.948311
8 a01 2021-01-30 04:26:28.968568 2021-01-30 06:12:37.210234
21 a01 2021-01-30 14:41:01.529666 2021-01-30 15:05:09.532786
23 a01 2021-01-31 15:11:08.547111 2021-01-31 17:03:43.464444

Here is the query I have tried:

SELECT
    MIN(t1.playid) as playid,
    t1.userid,
    SAFE_CAST(t1.created AS DATETIME) as created,
    MAX(SAFE_CAST(t2.stopped AS DATETIME)) as stopped
FROM
    MyTable as t1
LEFT OUTER JOIN
    MyTable as t2
        ON t1.userid = t2.userid
WHERE
    DATETIME_DIFF(SAFE_CAST(t2.stopped AS DATETIME),SAFE_CAST(t1.created AS DATETIME),MINUTE) 
    < 120
GROUP BY
    t1.playid, t1.userid, t1.created

and querying the previous table using this query:

SELECT
    MIN(playid) AS playid,
    userid,
    MIN(created) AS created,
    stopped,
FROM
    MyTable2
GROUP BY
    userid
    stopped

This is the (wrong) table I get when I run the first query:

playid userid created stopped
1 a01 2021-01-30 02:29:58.559858 2021-01-30 04:16:25.948311
2 a01 2021-01-30 02:41:24.023358 2021-01-30 04:30:11.863079
3 a01 2021-01-30 02:57:30.178579 2021-01-30 04:50:54.45988
4 a01 2021-01-30 03:11:41.098424 2021-01-30 05:11:01.246284
5 a01 2021-01-30 03:23:20.545288 2021-01-30 05:11:01.246284
6 a01 2021-01-30 03:46:10.237971 2021-01-30 05:43:38.231266
7 a01 2021-01-30 03:59:57.020326 2021-01-30 05:58:24.262351
8 a01 2021-01-30 04:26:28.968568 2021-01-30 06:12:37.210234
9 a01 2021-01-30 04:30:14.43786 2021-01-30 06:12:37.210234
10 a01 2021-01-30 04:43:17.669945 2021-01-30 06:12:37.210234
11 a01 2021-01-30 04:45:14.50346 2021-01-30 06:12:37.210234
12 a01 2021-01-30 04:47:10.334574 2021-01-30 06:12:37.210234
13 a01 2021-01-30 04:49:34.146011 2021-01-30 06:12:37.210234
14 a01 2021-01-30 04:50:57.948305 2021-01-30 06:12:37.210234
15 a01 2021-01-30 05:39:29.387396 2021-01-30 06:12:37.210234
16 a01 2021-01-30 05:41:44.524951 2021-01-30 06:12:37.210234
17 a01 2021-01-30 05:43:40.785809 2021-01-30 06:12:37.210234
18 a01 2021-01-30 05:55:10.851725 2021-01-30 06:12:37.210234
19 a01 2021-01-30 05:58:29.43821 2021-01-30 06:12:37.210234
20 a01 2021-01-30 06:00:54.168696 2021-01-30 06:12:37.210234
21 a01 2021-01-30 14:41:01.529666 2021-01-30 15:05:09.532786
22 a01 2021-01-30 14:53:48.053136 2021-01-30 15:05:09.532786
23 a01 2021-01-31 15:11:08.547111 2021-01-31 17:03:43.464444
24 a01 2021-01-31 15:11:40.248499 2021-01-31 17:03:43.464444
25 a01 2021-01-31 15:13:59.373145 2021-01-31 17:03:43.464444
26 a01 2021-01-31 15:32:23.20448 2021-01-31 17:03:43.464444
27 a01 2021-01-31 16:55:19.141051 2021-01-31 17:03:43.464444

The problem here is that the column created in the next row does not jump to the next playid with a created value which is larger/later than the stopped value of the current playid. Instead it just iterates through the table. Please see the tables above and use the playid column to make it easier to understand.

PS: I am very new to SQL. Please help me optimize this if this is very ineffective. Would be best to only have one query, of course.

Thank you!

Advertisement

Answer

This needs loops/recursion which is not supported natively in BigQuery SQL. Try user-defined functions, but they are not very scalable:

CREATE TEMP FUNCTION special_merge(x ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>>)
RETURNS ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>>
LANGUAGE js
AS """
  var intervalStart = x[0].created_at;
  var intervalId = x[0].playid;
  var intervalEnd;
  var result = [];
  for (var row of x)
  {
    if (row.stopped - intervalStart > 120*60*1000 && intervalEnd) {
      result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd});
      intervalStart = row.created_at;
      intervalId = row.playid
    }
    intervalEnd = row.stopped;
  }
  result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd});
  return result;
""";

with mytable as (
  select 1 as playid, 'a01' as userid, timestamp '2021-01-30 02:29:58.559858' as created_at, timestamp '2021-01-30 02:40:56.558436' as stopped union all
  select 2, 'a01', timestamp '2021-01-30 02:41:24.023358', timestamp '2021-01-30 02:55:24.112713' union all
  select 7, 'a01', timestamp '2021-01-30 03:59:57.020326', timestamp '2021-01-30 04:16:25.948311' union all
  select 8, 'a01', timestamp '2021-01-30 04:26:28.968568', timestamp '2021-01-30 04:30:11.863079' union all
  select 20, 'a01', timestamp '2021-01-30 06:00:54.168696', timestamp '2021-01-30 06:12:37.210234' union all
  select 21, 'a01', timestamp '2021-01-30 14:41:01.529666', timestamp '2021-01-30 14:53:01.349479' union all
  select 22, 'a01', timestamp '2021-01-30 14:53:48.053136', timestamp '2021-01-30 15:05:09.532786' union all
  select 23, 'a01', timestamp '2021-01-31 15:11:08.547111', timestamp '2021-01-31 15:11:37.564802' union all
  select 27, 'a01', timestamp '2021-01-31 16:55:19.141051', timestamp '2021-01-31 17:03:43.464444'
)
select userid, m.*
from (
  select userid, special_merge(array_agg(STRUCT(playid, created_at, stopped) order by created_at)) as merged
  from mytable
  group by userid
), unnest(merged) as m
order by userid, m.playid

enter image description here

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement