I have a table with events describing user navigation through flows, with this general shape (disregard the timestamps, I smashed some keys):
user timestamp event_name event_choice 1 12345678 start 1 45678912 choose option_a 1 45678912 end
After the option_a
event, users can go back and choose another option:
2 83475656 start 2 45678912 choose option_a 2 93847398 back 2 45678912 choose option_b 2 23487383 end
Users can also go through a flow multiple times:
3 83475656 start 3 45678912 choose option_a 3 93847398 end 3 93847398 start 3 45678912 choose option_b 3 23487383 end
I want to produce a table where each row represents a flow, from start
to end
, including the selected option along the way. For example:
user start_at end_at choice 1 83475656 23487383 option_a 2 83475656 23487383 option_b 3 83475656 23487383 option_a 3 83475656 23487383 option_b
I need to avoid:
- False results from the
back
event changing the option mid-flow. - False results from mixing up two different instances of the flow.
- False results from mixing up two different users.
I have no idea how to approach this. If it’s any difference, I’m using BigQuery, which has some differences with standard SQL and may have some specific functions.
Advertisement
Answer
You can add up the starts
to assign a grouping. Then you want the last option in each group:
select user, min(timestamp) as start_at, max(timestamp) as stop_at, array_agg( event_choice ignore nulls order by timestamp desc limit 1)[ordinal(1)] as choice from (select t.*, countif(event_name = 'start') over (partition by user order by timestamp) as grp from t ) t group by user, grp;
This makes some assumptions about the data, notably:
'start'
and'stop'
are lined up.- you want the most recent non-null
option_choice
.