I have actions and results mixed together in an “event” field. But each of these actions essentially “causes” the result (and the subsequent outcome) (let’s assume). I want to tie each future outcome back to the actions that caused it, for each user.
I want the value from the “outcome” field to be copied into a new field for each user’s most upcoming, future result (if a result has occurred after that action). And I want to record the time of that result/outcome. So for example, if user_1 does three events then has a good outcome, I want the word “good” in a new field for each of those three preceding events. If user_2 has 2 actions, then a bad outcome, then 3 actions: I want the 2 first actions to have “bad” and the last 3 actions to have “?” in the event_outcome field.
*Note: event_timestamp field does not always come chronologically (with relation to row number). I’ve shown it ordered this way to make it easier to understand.
What I have is this*
event_timestamp | user| event | outcome
2020-07-20 15:00:00.000 UTC | 1 | action-throw|
2020-07-20 15:01:00.000 UTC | 1 | result | good
2020-07-20 15:02:00.000 UTC | 1 | action-push |
2020-07-20 15:03:00.000 UTC | 2 | action-run |
2020-07-20 15:04:00.000 UTC | 2 | result | bad
2020-07-20 15:05:00.000 UTC | 2 | action-throw|
2020-07-20 15:06:00.000 UTC | 1 | action-push |
2020-07-20 15:07:00.000 UTC | 1 | result | bad
2020-07-20 15:08:00.000 UTC | 1 | action-push |
2020-07-20 15:09:00.000 UTC | 2 | result | good
2020-07-20 15:10:00.000 UTC | 2 | action-pull |
2020-07-20 15:11:00.000 UTC | 2 | action-push |
What I want is this
event_timestamp | user| event | outcome | event_outcome | event_outcome_timestamp
2020-07-20 15:00:00.000 UTC | 1 | action-throw| | good | 2020-07-20 15:01:00.000 UTC
2020-07-20 15:01:00.000 UTC | 1 | result | good | good | 2020-07-20 15:01:00.000 UTC
2020-07-20 15:02:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:03:00.000 UTC | 2 | action-run | | bad | 2020-07-20 15:04:00.000 UTC
2020-07-20 15:04:00.000 UTC | 2 | result | bad | bad | 2020-07-20 15:04:00.000 UTC
2020-07-20 15:05:00.000 UTC | 2 | action-throw| | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:06:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:07:00.000 UTC | 1 | result | bad | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:08:00.000 UTC | 1 | action-push | | ? | ?
2020-07-20 15:09:00.000 UTC | 2 | result | good | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:10:00.000 UTC | 2 | action-pull | | ? | ?
2020-07-20 15:11:00.000 UTC | 2 | action-push | | ? | ?
I can find the most recent outcome for each user (Ideally I would have “?” instead of null if no outcome is known). with this code:
select *
from (
select *
from table
where event_timestamp > x
) temp1
LEFT JOIN
(select user as user2, outcome as outcome_latest, event_timestamp as event_timestamp_latest_outcome
from(
select user, event_timestamp, outcome
row_number() over (partition by user order by UNIXMILLIS(event_timestamp) desc) as rn
from table
where event_timestamp > x
and outcome is not null
)
where rn = 1
) temp2
on temp1.user = temp2.user2
Which yeilds this (NOT what I want)
event_timestamp | user| event | outcome | outcome_latest | event_timestamp_latest_outcome
2020-07-20 15:00:00.000 UTC | 1 | action-throw| | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:01:00.000 UTC | 1 | result | good | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:02:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:03:00.000 UTC | 2 | action-run | | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:04:00.000 UTC | 2 | result | bad | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:05:00.000 UTC | 2 | action-throw| | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:06:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:07:00.000 UTC | 1 | result | bad | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:08:00.000 UTC | 1 | action-push | | bad | 2020-07-20 15:07:00.000 UTC
2020-07-20 15:09:00.000 UTC | 2 | result | good | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:10:00.000 UTC | 2 | action-pull | | good | 2020-07-20 15:09:00.000 UTC
2020-07-20 15:11:00.000 UTC | 2 | action-push | | good | 2020-07-20 15:09:00.000 UTC
But I don’t know where to go from here. Maybe another level of nesting “partition by”? Over the event field?…
We are using BigQuery, if that context helps. Thank you!
Advertisement
Answer
Here is how i would do it in regular SQL. I am assuming that Big Query supports the analytical functions that i have used.
with data
as (
select *
,case when (lag(outcome) over(partition by user1 order by event_timestamp) <> ' '
and outcome =' '
)/*Check if there has been a change in outcome by user*/
OR (outcome=' '
and row_number() over(partition by user1 order by event_timestamp)=1
)/*for the first record per user, we consider it to be a new group*/
then
row_number() over(partition by user1 order by event_timestamp)
else ' '
end as grp_val
from t
)
,computed_grp_val
as(select *
,sum(case when grp_val =' ' then 0 else 1 end) over(partition by user1 order by event_timestamp) as grp_val_1
from data
)
select *
,case when max(outcome) over(partition by user1,grp_val_1)=' ' then '?'
else max(outcome) over(partition by user1,grp_val_1)
end as comptued_outcome
,max(event_timestamp) over(partition by user1,grp_val_1) as computed_event_timestamp
from computed_grp_val
order by 1
+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
| event_timestamp | user1 | event | outcome | grp_val | grp_val_1 | comptued_outcome | computed_event_timestamp |
+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
| 2020-07-20 15:00:00.0000000 | 1 | action-throw | | 1 | 1 | good | 2020-07-20 15:01:00.0000000 |
| 2020-07-20 15:01:00.0000000 | 1 | result | good | 0 | 1 | good | 2020-07-20 15:01:00.0000000 |
| 2020-07-20 15:02:00.0000000 | 1 | action-push | | 3 | 2 | bad | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:03:00.0000000 | 2 | action-run | | 1 | 1 | bad | 2020-07-20 15:04:00.0000000 |
| 2020-07-20 15:04:00.0000000 | 2 | result | bad | 0 | 1 | bad | 2020-07-20 15:04:00.0000000 |
| 2020-07-20 15:05:00.0000000 | 2 | action-throw | | 3 | 2 | good | 2020-07-20 15:09:00.0000000 |
| 2020-07-20 15:06:00.0000000 | 1 | action-push | | 0 | 2 | bad | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:07:00.0000000 | 1 | result | bad | 0 | 2 | bad | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:08:00.0000000 | 1 | action-push | | 6 | 3 | ? | 2020-07-20 15:08:00.0000000 |
| 2020-07-20 15:09:00.0000000 | 2 | result | good | 0 | 2 | good | 2020-07-20 15:09:00.0000000 |
| 2020-07-20 15:10:00.0000000 | 2 | action-pull | | 5 | 3 | ? | 2020-07-20 15:11:00.0000000 |
| 2020-07-20 15:11:00.0000000 | 2 | action-push | | 0 | 3 | ? | 2020-07-20 15:11:00.0000000 |
+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
DB Fiddle Link
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=e97ebf473ccf3c33d7c6fa62fd14e51b