Is it possible to “group” rows within BigQuery/SQL depending on column values? Let’s say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.
The amount of stream_ad event can differ hence I can’t use a RANK() or ROW() to group them be based on those values.
|id, timestamp, event| |1 | 1231231 | first_visit| |2 | 1231232 | login| |3 | 1231233 | page_view| |4 | 1231234 | page_view| |5 | 1231235 | stream_start_init| |6 | 1231236 | stream_ad| |7 | 1231237 | stream_ad| |8 | 1231238 | stream_ad| |9 | 1231239 | stream_start| |6 | 1231216 | stream_resume| |6 | 1231236 | stream_ad| |7 | 1231217 | stream_ad| |8 | 1231258 | stream_ad| |10| 1231240 | page_view|
How I wish the table to be
|id, timestamp, event, group_id| |1 | 1231231 | first_visit, null| |2 | 1231232 | login, null| |3 | 1231233 | page_view, null| |4 | 1231234 | page_view, null| |5 | 1231235 | stream_start_init, group_1| |6 | 1231236 | stream_ad, group_1| |7 | 1231237 | stream_ad, group_1| |8 | 1231238 | stream_ad, group_1| |9 | 1231239 | stream_start, group_1| |6 | 1231216 | stream_resume, group_2| |6 | 1231236 | stream_ad, group_2| |7 | 1231217 | stream_ad, group_2| |8 | 1231258 | stream_ad, group_2| |10| 1231240 | page_view, null|
Advertisement
Answer
I wouldn’t assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of “stream_start_init” and “stream_resume” does what you want:
select t.*, countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id from t;
Note that this produces 0
for the first group — which seems like a good thing. You can convert that to a NULL
using NULLIF()
.
If you really want strings, you can use CONCAT()
.