Skip to content
Advertisement

How to transform a BigQuery table into a list of row sequences where sequences are aggregated by a sliding window in time?

I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc.

I want to transform this table into a list of Trip sequences where Trips are grouped into sequences by vehicle id and also by a sliding window in time. Essentially, each group / sequence consists of trips from the same vehicle id where the trips fall within a window of say 5 days. Naturally, the group / sequence will be of variable length (preferably with a max size where extra trips could be ignored). The window however is non-overlapping so a trip cannot be in two different groups / sequences. Finally, the sequences are ordered by StartTime.

Example: (window = 5 days)

[
 **Oct31 - Nov4**
(Vehicle1, [Trip7, Trip8, Trip9, Trip10]),
(Vehicle2, [Trip3, Trip4, Trip5]) 
 **Oct26 - Oct30**
(Vehicle1, [Trip1, Trip2, Trip3, Trip4, Trip5, Trip6]),   
(Vehicle2, [Trip1, Trip2]),
]

Variant of a previous question

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
WITH windows AS (
  SELECT start_day, DATE_ADD(start_day, INTERVAL 4 DAY) end_day
  FROM UNNEST(GENERATE_DATE_ARRAY('2019-10-01', CURRENT_DATE(), INTERVAL 5 DAY)) start_day
)
SELECT start_day, end_day, trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM `project.dataset.table` trip
JOIN windows ON start_time BETWEEN start_day AND end_day
GROUP BY start_day, end_day, vehicle_id

You can test, play with above using dummy data as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Vehicle1' vehicle_id, 'Trip1' trip_id, DATE '2019-10-29' start_time, DATE '2019-10-30' stop_time UNION ALL
  SELECT 'Vehicle1', 'Trip2', '2019-10-30', '2019-10-31' UNION ALL
  SELECT 'Vehicle1', 'Trip3', '2019-10-31', '2019-11-01' UNION ALL
  SELECT 'Vehicle1', 'Trip4', '2019-11-01', '2019-11-02' UNION ALL
  SELECT 'Vehicle1', 'Trip5', '2019-11-02', '2019-11-03' UNION ALL
  SELECT 'Vehicle1', 'Trip6', '2019-11-03', '2019-12-04' UNION ALL
  SELECT 'Vehicle1', 'Trip7', '2019-11-04', '2019-12-05' UNION ALL
  SELECT 'Vehicle2', 'Trip1', '2019-10-29', '2019-10-30' UNION ALL
  SELECT 'Vehicle2', 'Trip2', '2019-10-30', '2019-10-31' UNION ALL
  SELECT 'Vehicle2', 'Trip3', '2019-10-31', '2019-11-01' UNION ALL
  SELECT 'Vehicle2', 'Trip4', '2019-11-01', '2019-11-02'  
), windows AS (
  SELECT start_day, DATE_ADD(start_day, INTERVAL 4 DAY) end_day
  FROM UNNEST(GENERATE_DATE_ARRAY('2019-10-01', CURRENT_DATE(), INTERVAL 5 DAY)) start_day
)
SELECT start_day, end_day, trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM `project.dataset.table` trip
JOIN windows ON start_time BETWEEN start_day AND end_day
GROUP BY start_day, end_day, vehicle_id
-- ORDER BY start_day, end_day, vehicle_id

with result

enter image description here

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement