Is it possible to “group” rows within BigQuery/SQL depending on column values? Let’s say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.
The amount of stream_ad event can differ hence I can’t use a RANK() or ROW() to group them be based on those values.
|id, timestamp, event|
|1 | 1231231 | first_visit|
|2 | 1231232 | login|
|3 | 1231233 | page_view|
|4 | 1231234 | page_view|
|5 | 1231235 | stream_start_init|
|6 | 1231236 | stream_ad|
|7 | 1231237 | stream_ad|
|8 | 1231238 | stream_ad|
|9 | 1231239 | stream_start|
|6 | 1231216 | stream_resume|
|6 | 1231236 | stream_ad|
|7 | 1231217 | stream_ad|
|8 | 1231258 | stream_ad|
|10| 1231240 | page_view|
How I wish the table to be
|id, timestamp, event, group_id|
|1 | 1231231 | first_visit, null|
|2 | 1231232 | login, null|
|3 | 1231233 | page_view, null|
|4 | 1231234 | page_view, null|
|5 | 1231235 | stream_start_init, group_1|
|6 | 1231236 | stream_ad, group_1|
|7 | 1231237 | stream_ad, group_1|
|8 | 1231238 | stream_ad, group_1|
|9 | 1231239 | stream_start, group_1|
|6 | 1231216 | stream_resume, group_2|
|6 | 1231236 | stream_ad, group_2|
|7 | 1231217 | stream_ad, group_2|
|8 | 1231258 | stream_ad, group_2|
|10| 1231240 | page_view, null|
Advertisement
Answer
I wouldn’t assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of “stream_start_init” and “stream_resume” does what you want:
select t.*,
countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id
from t;
Note that this produces 0
for the first group — which seems like a good thing. You can convert that to a NULL
using NULLIF()
.
If you really want strings, you can use CONCAT()
.