I have a datetime column. I need to derive a column of total minutes elapsed from the first to the last value of every hour grouped by hour, but, in cases of overlapping event, the time should be distributed between two hours. There is also a condition where if the elapsed time exceeds 30 minutes in between two consecutive records, then it has to be ignored.
Below, I’ve explained in three phases, Original, Intermediate (calculating the running total) and Final.
And, I’m planning to take hourly incremental data on the same, so, how can we properly merge it with the old data is another question.
Sample data:
Moves_TS 1/4/2020 10:00 1/4/2020 10:25 1/4/2020 10:42 1/4/2020 10:56 1/4/2020 10:59 1/4/2020 11:02 1/4/2020 11:24 1/4/2020 11:43 1/4/2020 11:55 1/4/2020 12:26 1/4/2020 12:29
Intermediate layer:
Moves_TS Hour Running Total 1/4/2020 10:00 10 0 1/4/2020 10:25 10 25 1/4/2020 10:42 10 42 1/4/2020 10:56 10 56 1/4/2020 10:59 10 60 1/4/2020 11:02 11 2 1/4/2020 11:24 11 24 1/4/2020 11:43 11 43 1/4/2020 11:55 11 55 1/4/2020 12:26 12 0 1/4/2020 12:29 12 3
Final Output:
Hour Work done/Hour 10 60 11 55 12 3
Advertisement
Answer
This is a gaps-and-islands problem with some twists. First, I would summarize by the “islands” defined by the gaps of 30 minutes:
select min(moves_ts) as start_ts, max(moves_ts) as end_ts from (select o.*, count(prev_moves_ts) filter (where moves_ts > prev_moves_ts + interval '30 minute') over (order by moves_ts) as grp from (select o.*, lag(moves_ts) over (order by moves_ts) as prev_moves_ts from original o ) o ) o group by grp;
Then you can use this with generate_series()
to expand the data and calculate the overlaps with each hour:
with islands as ( select min(moves_ts) as start_ts, max(moves_ts) as end_ts from (select o.*, count(prev_moves_ts) filter (where moves_ts > prev_moves_ts + interval '30 minute') over (order by moves_ts) as grp from (select o.*, lag(moves_ts) over (order by moves_ts) as prev_moves_ts from original o ) o ) o group by grp ) select hh.hh, sum( least(hh.hh + interval '1 hour', i.end_ts) - greatest(hh.hh, i.start_ts) ) as duration from (select generate_series(date_trunc('hour', min(moves_ts)), date_trunc('hour', max(moves_ts)), interval '1 hour' ) hh from original o ) hh left join islands i on i.start_ts < hh.hh + interval '1 hour' and i.end_ts >= hh.hh group by hh.hh order by hh.hh;
Here is a db<>fiddle.