I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc.
I want to transform this table into a list of Trip sequences where Trips are grouped into sequences by vehicle id and also by a sliding window in time. Essentially, each group / sequence consists of trips from the same vehicle id where the trips fall within a window of say 5 days. Naturally, the group / sequence will be of variable length (preferably with a max size where extra trips could be ignored). The window however is non-overlapping so a trip cannot be in two different groups / sequences. Finally, the sequences are ordered by StartTime.
Example: (window = 5 days)
[ **Oct31 - Nov4** (Vehicle1, [Trip7, Trip8, Trip9, Trip10]), (Vehicle2, [Trip3, Trip4, Trip5]) **Oct26 - Oct30** (Vehicle1, [Trip1, Trip2, Trip3, Trip4, Trip5, Trip6]), (Vehicle2, [Trip1, Trip2]), ]
Variant of a previous question
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL WITH windows AS ( SELECT start_day, DATE_ADD(start_day, INTERVAL 4 DAY) end_day FROM UNNEST(GENERATE_DATE_ARRAY('2019-10-01', CURRENT_DATE(), INTERVAL 5 DAY)) start_day ) SELECT start_day, end_day, trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips FROM `project.dataset.table` trip JOIN windows ON start_time BETWEEN start_day AND end_day GROUP BY start_day, end_day, vehicle_id
You can test, play with above using dummy data as in example below
#standardSQL WITH `project.dataset.table` AS ( SELECT 'Vehicle1' vehicle_id, 'Trip1' trip_id, DATE '2019-10-29' start_time, DATE '2019-10-30' stop_time UNION ALL SELECT 'Vehicle1', 'Trip2', '2019-10-30', '2019-10-31' UNION ALL SELECT 'Vehicle1', 'Trip3', '2019-10-31', '2019-11-01' UNION ALL SELECT 'Vehicle1', 'Trip4', '2019-11-01', '2019-11-02' UNION ALL SELECT 'Vehicle1', 'Trip5', '2019-11-02', '2019-11-03' UNION ALL SELECT 'Vehicle1', 'Trip6', '2019-11-03', '2019-12-04' UNION ALL SELECT 'Vehicle1', 'Trip7', '2019-11-04', '2019-12-05' UNION ALL SELECT 'Vehicle2', 'Trip1', '2019-10-29', '2019-10-30' UNION ALL SELECT 'Vehicle2', 'Trip2', '2019-10-30', '2019-10-31' UNION ALL SELECT 'Vehicle2', 'Trip3', '2019-10-31', '2019-11-01' UNION ALL SELECT 'Vehicle2', 'Trip4', '2019-11-01', '2019-11-02' ), windows AS ( SELECT start_day, DATE_ADD(start_day, INTERVAL 4 DAY) end_day FROM UNNEST(GENERATE_DATE_ARRAY('2019-10-01', CURRENT_DATE(), INTERVAL 5 DAY)) start_day ) SELECT start_day, end_day, trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips FROM `project.dataset.table` trip JOIN windows ON start_time BETWEEN start_day AND end_day GROUP BY start_day, end_day, vehicle_id -- ORDER BY start_day, end_day, vehicle_id
with result