I have a data looks like below.
TS Col 2019-01-04T01:49:00.000Z B 2019-01-04T01:50:00.000Z B 2019-01-04T01:51:00.000Z B 2019-01-04T01:52:00.000Z A 2019-01-04T01:53:00.000Z A 2019-01-04T01:54:00.000Z A 2019-01-04T01:55:00.000Z A 2019-01-04T01:56:00.000Z C 2019-01-04T01:57:00.000Z C 2019-01-04T01:58:00.000Z B 2019-01-04T01:59:00.000Z B 2019-01-04T02:00:00.000Z B
I want to extract the value of the first and last “TS” column based on each “Col” column values (A, B, and C) when it changes. The expected output should be as follows
start_ts end_ts 2019-01-04T01:49:00.000Z 2019-01-04T01:51:00.000Z 2019-01-04T01:52:00.000Z 2019-01-04T01:55:00.000Z 2019-01-04T01:56:00.000Z 2019-01-04T01:57:00.000Z 2019-01-04T01:58:00.000Z 2019-01-04T02:00:00.000Z
Thanks for your help in advance!
Advertisement
Answer
This is a type of gaps-and-islands problem. This version is probably best addressed using the difference of row numbers:
select col, min(ts), max(ts) from (select t.* row_number() over (order by ts) as seqnum, row_number() over (partition by col order by ts) as seqnum_2 from t ) t group by col, (seqnum - seqnum_2);
This includes the col
value on each row, which seems very useful.
EDIT:
If you have the situation where you have duplicates timestamps in the data, you can use dense_rank()
rather than row_number()
.