Skip to content
Advertisement

How to get a first and last value of one column based on another column values

I have a data looks like below.

TS                         Col
2019-01-04T01:49:00.000Z    B
2019-01-04T01:50:00.000Z    B
2019-01-04T01:51:00.000Z    B
2019-01-04T01:52:00.000Z    A
2019-01-04T01:53:00.000Z    A
2019-01-04T01:54:00.000Z    A
2019-01-04T01:55:00.000Z    A
2019-01-04T01:56:00.000Z    C
2019-01-04T01:57:00.000Z    C
2019-01-04T01:58:00.000Z    B
2019-01-04T01:59:00.000Z    B
2019-01-04T02:00:00.000Z    B

I want to extract the value of the first and last “TS” column based on each “Col” column values (A, B, and C) when it changes. The expected output should be as follows

start_ts                     end_ts
2019-01-04T01:49:00.000Z    2019-01-04T01:51:00.000Z
2019-01-04T01:52:00.000Z    2019-01-04T01:55:00.000Z
2019-01-04T01:56:00.000Z    2019-01-04T01:57:00.000Z
2019-01-04T01:58:00.000Z    2019-01-04T02:00:00.000Z

Thanks for your help in advance!

Advertisement

Answer

This is a type of gaps-and-islands problem. This version is probably best addressed using the difference of row numbers:

select col, min(ts), max(ts)
from (select t.*
             row_number() over (order by ts) as seqnum,
             row_number() over (partition by col order by ts) as seqnum_2
      from t
     ) t
group by col, (seqnum - seqnum_2);

This includes the col value on each row, which seems very useful.

EDIT:

If you have the situation where you have duplicates timestamps in the data, you can use dense_rank() rather than row_number().

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement