Skip to content
Advertisement

What is the best way in SQL to combine sequential events based on matching end time to start time?

That database I work in records events based on a part ID and the times in which it is active. The issue I came across is these events are truncated to fit within a single day. If the active time for a part carries over to the next day, the event will be split by the number of days it ties in to. In this case, the active start timestamp on the next day matches the active end timestamp of the previous. I am trying to come up with a way to combine these split events into a single record with the “true” start and end times for when each part is active.


Here is a sample of what the dataset looks like:

date    part_id   active_start  active_end
1/1/2019    100   1/1/19 8:00   1/1/19 9:30
1/1/2019    100   1/1/19 14:00  1/2/19 0:00
1/2/2019    100   1/2/19 0:00   1/3/19 0:00
1/3/2019    100   1/3/19 0:00   1/4/19 0:00
1/4/2019    100   1/4/19 0:00   1/4/19 8:00
1/7/2019    100   1/7/19 6:00   1/8/19 0:00
1/8/2019    100   1/8/19 0:00   1/9/19 0:00
1/9/2019    100   1/9/19 0:00   1/9/19 11:30
1/11/2019   100   1/11/19 12:00 1/11/19 22:00
1/13/2019   100   1/13/19 14:30 1/14/19 0:00
1/14/2019   100   1/14/19 0:00  1/15/19 0:00
1/15/2019   100   1/15/19 0:00  1/15/19 8:30

I am trying to reduce it down to the following:

date    part_id   active_start    active_end
1/1/2019    100   1/1/19 8:00     1/1/19 9:30
1/1/2019    100   1/1/19 14:00    1/4/19 8:00
1/7/2019    100   1/7/19 6:00     1/9/19 11:30
1/11/2019   100   1/11/19 12:00   1/11/19 22:00
1/13/2019   100   1/13/19 14:30   1/15/19 8:30

There are ~70 different part numbers and each having anywhere up to 200 different active events over the observation period. Active events can last up to several days. As I am fairly inexperienced in SQL any help would be greatly appreciated.

Advertisement

Answer

This is a gaps and island problem, where you want to group together adjacent rows.

Here is one solution that uses window functions:

select 
    min(date) date,
    part_id,
    min(active_start) active_start,
    max(active_end) active_end
from (
    select
        t.*,
        sum(case when lag_active_end = active_start then 0 else 1 end)
            over(partition by part_id order by active_start) grp
    from (
        select 
            t.*, 
            lag(active_end) over(partition by part_id order by active_start) lag_active_end
        from mytable t
    ) t
) t
group by part_id, grp

The most inner query retrieves the end date of the previous record that has the same part_id. The intermediate query does a window sum that increases by 1 every time the previous end date is not equal to the current start date: this defines the groups of adjacent rows. Finally, the outer query aggregates by group, and computes the start and end of the range.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement