Group rows based on column values in SQL / BigQuer…

Is it possible to “group” rows within BigQuery/SQL depending on column values? Let’s say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.

The amount of stream_ad event can differ hence I can’t use a RANK() or ROW() to group them be based on those values.

|id, timestamp, event|
|1 |  1231231 | first_visit|
|2 |  1231232 | login|
|3 |  1231233 | page_view|
|4 |  1231234 | page_view| 
|5 |  1231235 | stream_start_init|
|6 |  1231236 | stream_ad|
|7 |  1231237 | stream_ad| 
|8 |  1231238 | stream_ad| 
|9 |  1231239 | stream_start|
|6 |  1231216 | stream_resume|
|6 |  1231236 | stream_ad|
|7 |  1231217 | stream_ad| 
|8 |  1231258 | stream_ad| 
|10|  1231240 | page_view|

​x
 
|id, timestamp, event||1 |  1231231 | first_visit||2 |  1231232 | login||3 |  1231233 | page_view||4 |  1231234 | page_view| |5 |  1231235 | stream_start_init||6 |  1231236 | stream_ad||7 |  1231237 | stream_ad| |8 |  1231238 | stream_ad| |9 |  1231239 | stream_start||6 |  1231216 | stream_resume||6 |  1231236 | stream_ad||7 |  1231217 | stream_ad| |8 |  1231258 | stream_ad| |10|  1231240 | page_view|​

How I wish the table to be

|id, timestamp, event, group_id|
|1 |  1231231 | first_visit, null|
|2 |  1231232 | login, null|
|3 |  1231233 | page_view, null|
|4 |  1231234 | page_view, null| 
|5 |  1231235 | stream_start_init, group_1|
|6 |  1231236 | stream_ad, group_1|
|7 |  1231237 | stream_ad, group_1| 
|8 |  1231238 | stream_ad, group_1| 
|9 |  1231239 | stream_start, group_1|
|6 |  1231216 | stream_resume, group_2|
|6 |  1231236 | stream_ad, group_2|
|7 |  1231217 | stream_ad, group_2| 
|8 |  1231258 | stream_ad, group_2| 
|10|  1231240 | page_view, null|

 
|id, timestamp, event, group_id||1 |  1231231 | first_visit, null||2 |  1231232 | login, null||3 |  1231233 | page_view, null||4 |  1231234 | page_view, null| |5 |  1231235 | stream_start_init, group_1||6 |  1231236 | stream_ad, group_1||7 |  1231237 | stream_ad, group_1| |8 |  1231238 | stream_ad, group_1| |9 |  1231239 | stream_start, group_1||6 |  1231216 | stream_resume, group_2||6 |  1231236 | stream_ad, group_2||7 |  1231217 | stream_ad, group_2| |8 |  1231258 | stream_ad, group_2| |10|  1231240 | page_view, null|​

Answer

I wouldn’t assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of “stream_start_init” and “stream_resume” does what you want:

select t.*,
       countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id
from t;

 
select t.*,       countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_idfrom t;​

Note that this produces 0 for the first group — which seems like a good thing. You can convert that to a NULL using NULLIF().

If you really want strings, you can use CONCAT().

Group rows based on column values in SQL / BigQuery

Advertisement

Answer