I have a table made up of rows of data collected through an indeterministic polling process. Each row has a start and end timestamp denoting the time period in which the data was collected. In some cases the data was collected contiguously, in which case the timestamp of one row will have the same value as the start timestamp for the next row. In other cases there is a break in time between one row and the next.
For example, in the table below, rows number 1,2,3 and 4 are all part of one time series of data. Similarly for rows 5, 6, 7 and 8 and again for rows 9 and 10. In between are time periods for which I do not have data.
Row Start_Timestamp End_Timestamp Data_Item --- --------------- -------------- --------- 1 2019-08-12_22:07:53 2019-08-12_22:09:57 100 2 2019-08-12_22:09:57 2019-08-12_22:12:01 203 3 2019-08-12_22:12:01 2019-08-12_22:13:03 487 4 2019-08-12_22:13:03 2019-08-12_22:16:19 113 5 2019-08-12_22:24:34 2019-08-12_22:26:37 632 6 2019-08-12_22:26:37 2019-08-12_22:27:40 532 7 2019-08-12_22:27:40 2019-08-12_22:28:42 543 8 2019-08-12_22:28:42 2019-08-12_22:31:57 142 9 2019-08-13_19:56:06 2019-08-13_19:57:08 351 10 2019-08-13_19:57:08 2019-08-13_19:58:10 982
I would like to groups these contiguous time series ideally as follows:
Row Series Start_Timestamp End_Timestamp Data_Item --- ------ --------------- -------------- ----------- 1 1 2019-08-12_22:07:53 2019-08-12_22:09:57 100 2 1 2019-08-12_22:09:57 2019-08-12_22:12:01 203 3 1 2019-08-12_22:12:01 2019-08-12_22:13:03 487 4 1 2019-08-12_22:13:03 2019-08-12_22:16:19 113 5 2 2019-08-12_22:24:34 2019-08-12_22:26:37 632 6 2 2019-08-12_22:26:37 2019-08-12_22:27:40 532 7 2 2019-08-12_22:27:40 2019-08-12_22:28:42 543 8 2 2019-08-12_22:28:42 2019-08-12_22:31:57 142 9 3 2019-08-13_19:56:06 2019-08-13_19:57:08 351 10 3 2019-08-13_19:57:08 2019-08-13_19:58:10 982
I am new to SQL and have been struggling with this problem. I appreciate any insights or advice on how I might achieve this.
Advertisement
Answer
This is a simplified gaps-and-island problem. Assuming that your RDBMS support window functions, you can approach this with a window sum. When the Start_Timestamp
of record is different than the End_Timestamp
of the previous record, a new group starts:
select t.Row, sum(case when Start_Timestamp = lag_End_Timestamp then 0 else 1 end) over(order by End_Timestamp) series, t.Start_Timestamp, t.End_Timestamp, t.Data_Item from ( select t.*, lag(End_Timestamp) over (order by End_Timestamp) lag_End_Timestamp from mytable t ) t
Row | series | Start_Timestamp | End_Timestamp | Data_Item --: | -----: | :------------------ | :------------------ | --------: 1 | 1 | 2019-08-12 22:07:53 | 2019-08-12 22:09:57 | 100 2 | 1 | 2019-08-12 22:09:57 | 2019-08-12 22:12:01 | 203 3 | 1 | 2019-08-12 22:12:01 | 2019-08-12 22:13:03 | 487 4 | 1 | 2019-08-12 22:13:03 | 2019-08-12 22:16:19 | 113 5 | 2 | 2019-08-12 22:24:34 | 2019-08-12 22:26:37 | 632 6 | 2 | 2019-08-12 22:26:37 | 2019-08-12 22:27:40 | 532 7 | 2 | 2019-08-12 22:27:40 | 2019-08-12 22:28:42 | 543 8 | 2 | 2019-08-12 22:28:42 | 2019-08-12 22:31:57 | 142 9 | 3 | 2019-08-13 19:56:06 | 2019-08-13 19:57:08 | 351 10 | 3 | 2019-08-13 19:57:08 | 2019-08-13 19:58:10 | 982