That database I work in records events based on a part ID and the times in which it is active. The issue I came across is these events are truncated to fit within a single day. If the active time for a part carries over to the next day, the event will be split by the number of days it ties in to. In this case, the active start timestamp on the next day matches the active end timestamp of the previous. I am trying to come up with a way to combine these split events into a single record with the “true” start and end times for when each part is active.
Here is a sample of what the dataset looks like:
date part_id active_start active_end 1/1/2019 100 1/1/19 8:00 1/1/19 9:30 1/1/2019 100 1/1/19 14:00 1/2/19 0:00 1/2/2019 100 1/2/19 0:00 1/3/19 0:00 1/3/2019 100 1/3/19 0:00 1/4/19 0:00 1/4/2019 100 1/4/19 0:00 1/4/19 8:00 1/7/2019 100 1/7/19 6:00 1/8/19 0:00 1/8/2019 100 1/8/19 0:00 1/9/19 0:00 1/9/2019 100 1/9/19 0:00 1/9/19 11:30 1/11/2019 100 1/11/19 12:00 1/11/19 22:00 1/13/2019 100 1/13/19 14:30 1/14/19 0:00 1/14/2019 100 1/14/19 0:00 1/15/19 0:00 1/15/2019 100 1/15/19 0:00 1/15/19 8:30
I am trying to reduce it down to the following:
date part_id active_start active_end 1/1/2019 100 1/1/19 8:00 1/1/19 9:30 1/1/2019 100 1/1/19 14:00 1/4/19 8:00 1/7/2019 100 1/7/19 6:00 1/9/19 11:30 1/11/2019 100 1/11/19 12:00 1/11/19 22:00 1/13/2019 100 1/13/19 14:30 1/15/19 8:30
There are ~70 different part numbers and each having anywhere up to 200 different active events over the observation period. Active events can last up to several days. As I am fairly inexperienced in SQL any help would be greatly appreciated.
Advertisement
Answer
This is a gaps and island problem, where you want to group together adjacent rows.
Here is one solution that uses window functions:
select min(date) date, part_id, min(active_start) active_start, max(active_end) active_end from ( select t.*, sum(case when lag_active_end = active_start then 0 else 1 end) over(partition by part_id order by active_start) grp from ( select t.*, lag(active_end) over(partition by part_id order by active_start) lag_active_end from mytable t ) t ) t group by part_id, grp
The most inner query retrieves the end date of the previous record that has the same part_id
. The intermediate query does a window sum that increases by 1 every time the previous end date is not equal to the current start date: this defines the groups of adjacent rows. Finally, the outer query aggregates by group, and computes the start and end of the range.