Skip to content
Advertisement

SQL – Find most recent value from other fields (per user, between occurrences)

I have actions and results mixed together in an “event” field. But each of these actions essentially “causes” the result (and the subsequent outcome) (let’s assume). I want to tie each future outcome back to the actions that caused it, for each user.

I want the value from the “outcome” field to be copied into a new field for each user’s most upcoming, future result (if a result has occurred after that action). And I want to record the time of that result/outcome. So for example, if user_1 does three events then has a good outcome, I want the word “good” in a new field for each of those three preceding events. If user_2 has 2 actions, then a bad outcome, then 3 actions: I want the 2 first actions to have “bad” and the last 3 actions to have “?” in the event_outcome field.

*Note: event_timestamp field does not always come chronologically (with relation to row number). I’ve shown it ordered this way to make it easier to understand.

What I have is this*

event_timestamp             |   user|   event       |   outcome
2020-07-20 15:00:00.000 UTC |   1   |   action-throw|   
2020-07-20 15:01:00.000 UTC |   1   |   result      |   good
2020-07-20 15:02:00.000 UTC |   1   |   action-push |   
2020-07-20 15:03:00.000 UTC |   2   |   action-run  |   
2020-07-20 15:04:00.000 UTC |   2   |   result      |   bad
2020-07-20 15:05:00.000 UTC |   2   |   action-throw|   
2020-07-20 15:06:00.000 UTC |   1   |   action-push |   
2020-07-20 15:07:00.000 UTC |   1   |   result      |   bad
2020-07-20 15:08:00.000 UTC |   1   |   action-push |   
2020-07-20 15:09:00.000 UTC |   2   |   result      |   good
2020-07-20 15:10:00.000 UTC |   2   |   action-pull |   
2020-07-20 15:11:00.000 UTC |   2   |   action-push |   

What I want is this

event_timestamp             |   user|   event       |   outcome |   event_outcome   |   event_outcome_timestamp
2020-07-20 15:00:00.000 UTC |   1   |   action-throw|           |   good            |   2020-07-20 15:01:00.000 UTC
2020-07-20 15:01:00.000 UTC |   1   |   result      |   good    |   good            |   2020-07-20 15:01:00.000 UTC
2020-07-20 15:02:00.000 UTC |   1   |   action-push |           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:03:00.000 UTC |   2   |   action-run  |           |   bad             |   2020-07-20 15:04:00.000 UTC
2020-07-20 15:04:00.000 UTC |   2   |   result      |   bad     |   bad             |   2020-07-20 15:04:00.000 UTC
2020-07-20 15:05:00.000 UTC |   2   |   action-throw|           |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:06:00.000 UTC |   1   |   action-push |           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:07:00.000 UTC |   1   |   result      |   bad     |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:08:00.000 UTC |   1   |   action-push |           |   ?               |   ?
2020-07-20 15:09:00.000 UTC |   2   |   result      |   good    |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:10:00.000 UTC |   2   |   action-pull |           |   ?               |   ?
2020-07-20 15:11:00.000 UTC |   2   |   action-push |           |   ?               |   ?

I can find the most recent outcome for each user (Ideally I would have “?” instead of null if no outcome is known). with this code:

select *
from (
    select *
    from table
    where event_timestamp > x
    ) temp1
LEFT JOIN
(select user as user2, outcome as outcome_latest, event_timestamp as event_timestamp_latest_outcome
from(
    select user, event_timestamp, outcome
    row_number() over (partition by user order by UNIXMILLIS(event_timestamp) desc) as rn
    from table
    where event_timestamp > x
    and outcome is not null
    )
where rn = 1
) temp2
on temp1.user = temp2.user2

Which yeilds this (NOT what I want)

event_timestamp             |   user|   event       |   outcome |   outcome_latest  |   event_timestamp_latest_outcome
2020-07-20 15:00:00.000 UTC |   1   |   action-throw|           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:01:00.000 UTC |   1   |   result      |   good    |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:02:00.000 UTC |   1   |   action-push |           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:03:00.000 UTC |   2   |   action-run  |           |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:04:00.000 UTC |   2   |   result      |   bad     |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:05:00.000 UTC |   2   |   action-throw|           |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:06:00.000 UTC |   1   |   action-push |           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:07:00.000 UTC |   1   |   result      |   bad     |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:08:00.000 UTC |   1   |   action-push |           |   bad             |   2020-07-20 15:07:00.000 UTC
2020-07-20 15:09:00.000 UTC |   2   |   result      |   good    |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:10:00.000 UTC |   2   |   action-pull |           |   good            |   2020-07-20 15:09:00.000 UTC
2020-07-20 15:11:00.000 UTC |   2   |   action-push |           |   good            |   2020-07-20 15:09:00.000 UTC

But I don’t know where to go from here. Maybe another level of nesting “partition by”? Over the event field?…

We are using BigQuery, if that context helps. Thank you!

Advertisement

Answer

Here is how i would do it in regular SQL. I am assuming that Big Query supports the analytical functions that i have used.

with data
 as (
select *
      ,case when (lag(outcome) over(partition by user1 order by event_timestamp) <> ' '
                  and outcome =' '
                 )/*Check if there has been a change in outcome by user*/
             OR (outcome=' ' 
                 and row_number() over(partition by user1 order by event_timestamp)=1
                )/*for the first record per user, we consider it to be a new group*/
            then
                row_number() over(partition by user1 order by event_timestamp)
            else ' '
        end as grp_val
  from t
     )
    ,computed_grp_val
     as(select *
               ,sum(case when grp_val =' ' then 0 else 1 end) over(partition by user1 order by event_timestamp) as grp_val_1
         from data
        )
select *
      ,case when max(outcome) over(partition by user1,grp_val_1)=' ' then '?'
            else max(outcome) over(partition by user1,grp_val_1)
        end as comptued_outcome
      ,max(event_timestamp) over(partition by user1,grp_val_1) as computed_event_timestamp
  from computed_grp_val
order by 1  



+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
|       event_timestamp       | user1 |    event     | outcome | grp_val | grp_val_1 | comptued_outcome |  computed_event_timestamp   |
+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+
| 2020-07-20 15:00:00.0000000 |     1 | action-throw |         |       1 |         1 | good             | 2020-07-20 15:01:00.0000000 |
| 2020-07-20 15:01:00.0000000 |     1 | result       | good    |       0 |         1 | good             | 2020-07-20 15:01:00.0000000 |
| 2020-07-20 15:02:00.0000000 |     1 | action-push  |         |       3 |         2 | bad              | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:03:00.0000000 |     2 | action-run   |         |       1 |         1 | bad              | 2020-07-20 15:04:00.0000000 |
| 2020-07-20 15:04:00.0000000 |     2 | result       | bad     |       0 |         1 | bad              | 2020-07-20 15:04:00.0000000 |
| 2020-07-20 15:05:00.0000000 |     2 | action-throw |         |       3 |         2 | good             | 2020-07-20 15:09:00.0000000 |
| 2020-07-20 15:06:00.0000000 |     1 | action-push  |         |       0 |         2 | bad              | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:07:00.0000000 |     1 | result       | bad     |       0 |         2 | bad              | 2020-07-20 15:07:00.0000000 |
| 2020-07-20 15:08:00.0000000 |     1 | action-push  |         |       6 |         3 | ?                | 2020-07-20 15:08:00.0000000 |
| 2020-07-20 15:09:00.0000000 |     2 | result       | good    |       0 |         2 | good             | 2020-07-20 15:09:00.0000000 |
| 2020-07-20 15:10:00.0000000 |     2 | action-pull  |         |       5 |         3 | ?                | 2020-07-20 15:11:00.0000000 |
| 2020-07-20 15:11:00.0000000 |     2 | action-push  |         |       0 |         3 | ?                | 2020-07-20 15:11:00.0000000 |
+-----------------------------+-------+--------------+---------+---------+-----------+------------------+-----------------------------+

DB Fiddle Link

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=e97ebf473ccf3c33d7c6fa62fd14e51b

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement