I have actions and results mixed together in an “event” field. But each of these actions essentially “causes” the result (and the subsequent outcome) (let’s assume). I want to tie each future outcome back to the actions that caused it, for each user.
I want the value from the “outcome” field to be copied into a new field for each user’s most upcoming, future result (if a result has occurred after that action). And I want to record the time of that result/outcome. So for example, if user_1 does three events then has a good outcome, I want the word “good” in a new field for each of those three preceding events. If user_2 has 2 actions, then a bad outcome, then 3 actions: I want the 2 first actions to have “bad” and the last 3 actions to have “?” in the event_outcome field.
*Note: event_timestamp field does not always come chronologically (with relation to row number). I’ve shown it ordered this way to make it easier to understand.
What I have is this*
event_timestamp | user| event | outcome 2020-07-20 15:00:00.000 UTC | 1 | action-throw| 2020-07-20 15:01:00.000 UTC | 1 | result | good 2020-07-20 15:02:00.000 UTC | 1 | action-push | 2020-07-20 15:03:00.000 UTC | 2 | action-run | 2020-07-20 15:04:00.000 UTC | 2 | result | bad 2020-07-20 15:05:00.000 UTC | 2 | action-throw| 2020-07-20 15:06:00.000 UTC | 1 | action-push | 2020-07-20 15:07:00.000 UTC | 1 | result | bad 2020-07-20 15:08:00.000 UTC | 1 | action-push | 2020-07-20 15:09:00.000 UTC | 2 | result | good 2020-07-20 15:10:00.000 UTC | 2 | action-pull | 2020-07-20 15:11:00.000 UTC | 2 | action-push |
What I want is this
event_timestamp | user| event | outcome | event_outcome | event_outcome_timestamp 2020-07-20 15:00:00.000 UTC | 1 | action-throw| | good | 2020-07-20 15:01:00.000 UTC 2020-07-20 15:01:00.000 UTC | 1 | result | good | good | 2020-07-20 15:01:00.000 UTC 2020-07-20 15:02:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:03:00.000 UTC | 2 | action-run | | bad | 2020-07-20 15:04:00.000 UTC 2020-07-20 15:04:00.000 UTC | 2 | result | bad | bad | 2020-07-20 15:04:00.000 UTC 2020-07-20 15:05:00.000 UTC | 2 | action-throw| | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:06:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:07:00.000 UTC | 1 | result | bad | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:08:00.000 UTC | 1 | action-push | | ? | ? 2020-07-20 15:09:00.000 UTC | 2 | result | good | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:10:00.000 UTC | 2 | action-pull | | ? | ? 2020-07-20 15:11:00.000 UTC | 2 | action-push | | ? | ?
I can find the most recent outcome for each user (Ideally I would have “?” instead of null if no outcome is known). with this code:
select * from ( select * from table where event_timestamp > x ) temp1 LEFT JOIN (select user as user2, outcome as outcome_latest, event_timestamp as event_timestamp_latest_outcome from( select user, event_timestamp, outcome row_number() over (partition by user order by UNIXMILLIS(event_timestamp) desc) as rn from table where event_timestamp > x and outcome is not null ) where rn = 1 ) temp2 on temp1.user = temp2.user2
Which yeilds this (NOT what I want)
event_timestamp | user| event | outcome | outcome_latest | event_timestamp_latest_outcome 2020-07-20 15:00:00.000 UTC | 1 | action-throw| | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:01:00.000 UTC | 1 | result | good | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:02:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:03:00.000 UTC | 2 | action-run | | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:04:00.000 UTC | 2 | result | bad | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:05:00.000 UTC | 2 | action-throw| | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:06:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:07:00.000 UTC | 1 | result | bad | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:08:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC 2020-07-20 15:09:00.000 UTC | 2 | result | good | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:10:00.000 UTC | 2 | action-pull | | good | 2020-07-20 15:09:00.000 UTC 2020-07-20 15:11:00.000 UTC | 2 | action-push | | good | 2020-07-20 15:09:00.000 UTC
But I don’t know where to go from here. Maybe another level of nesting “partition by”? Over the event field?…
We are using BigQuery, if that context helps. Thank you!
Advertisement
Answer
Here is how i would do it in regular SQL. I am assuming that Big Query supports the analytical functions that i have used.
with data as ( select * ,case when (lag(outcome) over(partition by user1 order by event_timestamp) <> ' ' and outcome =' ' )/*Check if there has been a change in outcome by user*/ OR (outcome=' ' and row_number() over(partition by user1 order by event_timestamp)=1 )/*for the first record per user, we consider it to be a new group*/ then row_number() over(partition by user1 order by event_timestamp) else ' ' end as grp_val from t ) ,computed_grp_val as(select * ,sum(case when grp_val =' ' then 0 else 1 end) over(partition by user1 order by event_timestamp) as grp_val_1 from data ) select * ,case when max(outcome) over(partition by user1,grp_val_1)=' ' then '?' else max(outcome) over(partition by user1,grp_val_1) end as comptued_outcome ,max(event_timestamp) over(partition by user1,grp_val_1) as computed_event_timestamp from computed_grp_val order by 1 +-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+ | event_timestamp | user1 | event | outcome | grp_val | grp_val_1 | comptued_outcome | computed_event_timestamp | +-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+ | 2020-07-20 15:00:00.0000000 | 1 | action-throw | | 1 | 1 | good | 2020-07-20 15:01:00.0000000 | | 2020-07-20 15:01:00.0000000 | 1 | result | good | 0 | 1 | good | 2020-07-20 15:01:00.0000000 | | 2020-07-20 15:02:00.0000000 | 1 | action-push | | 3 | 2 | bad | 2020-07-20 15:07:00.0000000 | | 2020-07-20 15:03:00.0000000 | 2 | action-run | | 1 | 1 | bad | 2020-07-20 15:04:00.0000000 | | 2020-07-20 15:04:00.0000000 | 2 | result | bad | 0 | 1 | bad | 2020-07-20 15:04:00.0000000 | | 2020-07-20 15:05:00.0000000 | 2 | action-throw | | 3 | 2 | good | 2020-07-20 15:09:00.0000000 | | 2020-07-20 15:06:00.0000000 | 1 | action-push | | 0 | 2 | bad | 2020-07-20 15:07:00.0000000 | | 2020-07-20 15:07:00.0000000 | 1 | result | bad | 0 | 2 | bad | 2020-07-20 15:07:00.0000000 | | 2020-07-20 15:08:00.0000000 | 1 | action-push | | 6 | 3 | ? | 2020-07-20 15:08:00.0000000 | | 2020-07-20 15:09:00.0000000 | 2 | result | good | 0 | 2 | good | 2020-07-20 15:09:00.0000000 | | 2020-07-20 15:10:00.0000000 | 2 | action-pull | | 5 | 3 | ? | 2020-07-20 15:11:00.0000000 | | 2020-07-20 15:11:00.0000000 | 2 | action-push | | 0 | 3 | ? | 2020-07-20 15:11:00.0000000 | +-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
DB Fiddle Link
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=e97ebf473ccf3c33d7c6fa62fd14e51b