I have a table with playid, userid, created, and stopped. I want to merge certain rows by keeping the earliest created and the latest stopped within 120 minutes from the earliest created.
See the following table:
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 02:40:56.558436 |
2 | a01 | 2021-01-30 02:41:24.023358 | 2021-01-30 02:55:24.112713 |
3 | a01 | 2021-01-30 02:57:30.178579 | 2021-01-30 03:11:14.866678 |
4 | a01 | 2021-01-30 03:11:41.098424 | 2021-01-30 03:22:50.155918 |
5 | a01 | 2021-01-30 03:23:20.545288 | 2021-01-30 03:36:37.027486 |
6 | a01 | 2021-01-30 03:46:10.237971 | 2021-01-30 03:59:17.526151 |
7 | a01 | 2021-01-30 03:59:57.020326 | 2021-01-30 04:16:25.948311 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 04:30:11.863079 |
9 | a01 | 2021-01-30 04:30:14.43786 | 2021-01-30 04:41:42.231915 |
10 | a01 | 2021-01-30 04:43:17.669945 | 2021-01-30 04:45:10.443101 |
11 | a01 | 2021-01-30 04:45:14.50346 | 2021-01-30 04:47:07.082114 |
12 | a01 | 2021-01-30 04:47:10.334574 | 2021-01-30 04:49:30.96017 |
13 | a01 | 2021-01-30 04:49:34.146011 | 2021-01-30 04:50:54.45988 |
14 | a01 | 2021-01-30 04:50:57.948305 | 2021-01-30 05:11:01.246284 |
15 | a01 | 2021-01-30 05:39:29.387396 | 2021-01-30 05:41:39.508654 |
16 | a01 | 2021-01-30 05:41:44.524951 | 2021-01-30 05:43:38.231266 |
17 | a01 | 2021-01-30 05:43:40.785809 | 2021-01-30 05:54:40.711381 |
18 | a01 | 2021-01-30 05:55:10.851725 | 2021-01-30 05:58:24.262351 |
19 | a01 | 2021-01-30 05:58:29.43821 | 2021-01-30 06:00:50.870644 |
20 | a01 | 2021-01-30 06:00:54.168696 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 14:53:01.349479 |
22 | a01 | 2021-01-30 14:53:48.053136 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 15:11:37.564802 |
24 | a01 | 2021-01-31 15:11:40.248499 | 2021-01-31 15:13:40.21787 |
25 | a01 | 2021-01-31 15:13:59.373145 | 2021-01-31 15:31:54.099898 |
26 | a01 | 2021-01-31 15:32:23.20448 | 2021-01-31 15:46:33.993751 |
27 | a01 | 2021-01-31 16:55:19.141051 | 2021-01-31 17:03:43.464444 |
I want the query to give this table (see below). Please note that the next created is after the previous stopped, meaning the created with playid “8” is after stopped with playid “1”:
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 04:16:25.948311 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 17:03:43.464444 |
Here is the query I have tried:
SELECT MIN(t1.playid) as playid, t1.userid, SAFE_CAST(t1.created AS DATETIME) as created, MAX(SAFE_CAST(t2.stopped AS DATETIME)) as stopped FROM MyTable as t1 LEFT OUTER JOIN MyTable as t2 ON t1.userid = t2.userid WHERE DATETIME_DIFF(SAFE_CAST(t2.stopped AS DATETIME),SAFE_CAST(t1.created AS DATETIME),MINUTE) < 120 GROUP BY t1.playid, t1.userid, t1.created
and querying the previous table using this query:
SELECT MIN(playid) AS playid, userid, MIN(created) AS created, stopped, FROM MyTable2 GROUP BY userid stopped
This is the (wrong) table I get when I run the first query:
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 04:16:25.948311 |
2 | a01 | 2021-01-30 02:41:24.023358 | 2021-01-30 04:30:11.863079 |
3 | a01 | 2021-01-30 02:57:30.178579 | 2021-01-30 04:50:54.45988 |
4 | a01 | 2021-01-30 03:11:41.098424 | 2021-01-30 05:11:01.246284 |
5 | a01 | 2021-01-30 03:23:20.545288 | 2021-01-30 05:11:01.246284 |
6 | a01 | 2021-01-30 03:46:10.237971 | 2021-01-30 05:43:38.231266 |
7 | a01 | 2021-01-30 03:59:57.020326 | 2021-01-30 05:58:24.262351 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 06:12:37.210234 |
9 | a01 | 2021-01-30 04:30:14.43786 | 2021-01-30 06:12:37.210234 |
10 | a01 | 2021-01-30 04:43:17.669945 | 2021-01-30 06:12:37.210234 |
11 | a01 | 2021-01-30 04:45:14.50346 | 2021-01-30 06:12:37.210234 |
12 | a01 | 2021-01-30 04:47:10.334574 | 2021-01-30 06:12:37.210234 |
13 | a01 | 2021-01-30 04:49:34.146011 | 2021-01-30 06:12:37.210234 |
14 | a01 | 2021-01-30 04:50:57.948305 | 2021-01-30 06:12:37.210234 |
15 | a01 | 2021-01-30 05:39:29.387396 | 2021-01-30 06:12:37.210234 |
16 | a01 | 2021-01-30 05:41:44.524951 | 2021-01-30 06:12:37.210234 |
17 | a01 | 2021-01-30 05:43:40.785809 | 2021-01-30 06:12:37.210234 |
18 | a01 | 2021-01-30 05:55:10.851725 | 2021-01-30 06:12:37.210234 |
19 | a01 | 2021-01-30 05:58:29.43821 | 2021-01-30 06:12:37.210234 |
20 | a01 | 2021-01-30 06:00:54.168696 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 15:05:09.532786 |
22 | a01 | 2021-01-30 14:53:48.053136 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 17:03:43.464444 |
24 | a01 | 2021-01-31 15:11:40.248499 | 2021-01-31 17:03:43.464444 |
25 | a01 | 2021-01-31 15:13:59.373145 | 2021-01-31 17:03:43.464444 |
26 | a01 | 2021-01-31 15:32:23.20448 | 2021-01-31 17:03:43.464444 |
27 | a01 | 2021-01-31 16:55:19.141051 | 2021-01-31 17:03:43.464444 |
The problem here is that the column created in the next row does not jump to the next playid with a created value which is larger/later than the stopped value of the current playid. Instead it just iterates through the table. Please see the tables above and use the playid column to make it easier to understand.
PS: I am very new to SQL. Please help me optimize this if this is very ineffective. Would be best to only have one query, of course.
Thank you!
Advertisement
Answer
This needs loops/recursion which is not supported natively in BigQuery SQL. Try user-defined functions, but they are not very scalable:
CREATE TEMP FUNCTION special_merge(x ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>>) RETURNS ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>> LANGUAGE js AS """ var intervalStart = x[0].created_at; var intervalId = x[0].playid; var intervalEnd; var result = []; for (var row of x) { if (row.stopped - intervalStart > 120*60*1000 && intervalEnd) { result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd}); intervalStart = row.created_at; intervalId = row.playid } intervalEnd = row.stopped; } result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd}); return result; """; with mytable as ( select 1 as playid, 'a01' as userid, timestamp '2021-01-30 02:29:58.559858' as created_at, timestamp '2021-01-30 02:40:56.558436' as stopped union all select 2, 'a01', timestamp '2021-01-30 02:41:24.023358', timestamp '2021-01-30 02:55:24.112713' union all select 7, 'a01', timestamp '2021-01-30 03:59:57.020326', timestamp '2021-01-30 04:16:25.948311' union all select 8, 'a01', timestamp '2021-01-30 04:26:28.968568', timestamp '2021-01-30 04:30:11.863079' union all select 20, 'a01', timestamp '2021-01-30 06:00:54.168696', timestamp '2021-01-30 06:12:37.210234' union all select 21, 'a01', timestamp '2021-01-30 14:41:01.529666', timestamp '2021-01-30 14:53:01.349479' union all select 22, 'a01', timestamp '2021-01-30 14:53:48.053136', timestamp '2021-01-30 15:05:09.532786' union all select 23, 'a01', timestamp '2021-01-31 15:11:08.547111', timestamp '2021-01-31 15:11:37.564802' union all select 27, 'a01', timestamp '2021-01-31 16:55:19.141051', timestamp '2021-01-31 17:03:43.464444' ) select userid, m.* from ( select userid, special_merge(array_agg(STRUCT(playid, created_at, stopped) order by created_at)) as merged from mytable group by userid ), unnest(merged) as m order by userid, m.playid