Skip to content
Advertisement

Using SQL to group consecutive items that share a common status (dummy data included)

Given a table that has sometimes repeated statuses within a group (in this case “vehicles”), I want to consolidate those statuses into a single row and aggregate status_seconds. The data looks like this (I’ll include some TSQL below to select dummy data into a temp table to make it easy to work with this example)

raw data

I want to, for example, consolidate the first three rows of this table for vehicle T101 into a single row with status_seconds = 1+1+2 (4 seconds). For the dummy data, these are the vehicles with consecutive status rows that need to be consolidated.

data to consolidate

Note that in rows 5-7, the travelling status for T101 is broken up by a T102 status.

This seems like a problem for a recursive CTE to me, but I’m finding it difficult to solve.

So far, I’ve been able to identify the anchor nodes for the above. I.e. for each vehicle, I can identify the last occurrence of a status for a vehicle. Here’s the dummy data and a CTE that identifies Anchor nodes.

CREATE TABLE ##vehiclesAndStates
(
id INT,
vehicle_name VARCHAR(30),
vehicle_status VARCHAR(30),
status_end_time DATETIME, 
status_seconds INT
)
INSERT INTO ##vehiclesAndStates VALUES 
     (100, 'T101', 'STOPPED', '2020-12-04 09:43:18.000',  1)
    ,(801, 'T101', 'STOPPED', '2020-12-04 09:43:19.000', 1)
    ,(745, 'T101', 'STOPPED', '2020-12-04 09:43:20.000', 2)
    ,(925, 'T101', 'TURNING', '2020-12-04 09:43:22.000', 1)
    ,(626, 'T101', 'TRAVELLING', '2020-12-04 09:43:23.000', 10)
    ,(401, 'T102', 'STOPPED', '2020-12-04 09:43:23.000', 10)    
    ,(201, 'T101', 'TRAVELLING', '2020-12-04 09:43:33.000', 1)
    ,(808, 'T102', 'STOPPPED', '2020-12-04 09:43:33.000', 3)
    ,(707, 'T102', 'STOPPPED', '2020-12-04 09:43:35.000', 7)
    ,(888, 'T101', 'TURNING', '2020-12-04 09:43:34.000', 1)
    ,(42, 'T101', 'STOPPED', '2020-12-04 09:43:35.000', 3)
    ,(2, 'T102', 'PARKED', '2020-12-04 09:43:35.000', 10)
    ,(911, 'T101', 'TRAVELLING', '2020-12-04 09:43:35.000', 1)

SELECT * FROM ##vehiclesAndStates


-- identify anchor nodes: rows where the previous status for a vehicle was different 

;with cte_AnchorNodes as 
  (
  
SELECT i.*
  FROM (
    SELECT 
         a.ID
        ,a.vehicle_name
        ,a.vehicle_status
        ,a.status_end_time
        ,a.status_seconds   
        ,previous_vehicle_status = LAG(a.vehicle_status,1) OVER (
            ORDER BY a.vehicle_name, a.status_end_time
        )
        ,previous_ID = LAG(a.ID,1) OVER (
            ORDER BY a.vehicle_name, a.status_end_time
        )
    FROM 
        ##vehiclesAndStates a
    ) i
    WHERE i.vehicle_status <> IsNull(i.previous_vehicle_status, 'Handle Nulls')
)

RESULT anchor nodes

However, I’m struggling to make the recursive CTE work:

–Select * From cte_AnchorNodes a order by a.vehicle_name, a.status_end_time

,cteRecursiveStatuses (Id, VehicleName, VehicleStatus, StatusEndTime, recursionDepth)  AS
  (
    SELECT  
             a.ID
            ,a.vehicle_name
            ,a.vehicle_status
            ,a.status_end_time
            ,0 recursionDepth 
      FROM  cte_AnchorNodes a

    UNION ALL

    SELECT 
            ?? 
    FROM 
            ##vehiclesAndStates b
    JOIN 
            cteRecursiveStatuses r ON r.Id = ??
     AND b.vehicle_status = r.VehicleStatus
  ) 

Select * From cteRecursiveStatuses

DROP TABLE ##vehiclesAndStates

Advertisement

Answer

This is a typical gaps-and-islands problem, where you want to group together “adjacent” rows that share the same vehicle and status (the islands).

You don’t need a recursive query for this: window functions can get this done. Here, the simplest approach probably is to use the difference between row numbers to identify the groups.

select vehicle_name, vehicle_status, 
    min(status_end_time) as min_status_end_time, 
    max(status_end_time) as max_status_end_time, 
    sum(status_seconds)  as sum_status_seconds 
from (
    select vs.*, 
        row_number() over(partition by vehicle_name order by status_end_time) rn1,
        row_number() over(partition by vehicle_name, vehicle_status order by status_end_time) rn2
    from ##vehiclesAndStates vs
) t
group by vehicle_name, vehicle_status, rn1 - rn2
order by vehicle_name, min(status_end_time)

You can run the subquery separately and look how the row numbers change to understand more.

For your sample data, the query returns:

vehicle_name | vehicle_status | min_status_end_time     | max_status_end_time     | sum_status_seconds
:----------- | :------------- | :---------------------- | :---------------------- | -----------------:
T101         | STOPPED        | 2020-12-04 09:43:18.000 | 2020-12-04 09:43:20.000 |                  4
T101         | TURNING        | 2020-12-04 09:43:22.000 | 2020-12-04 09:43:22.000 |                  1
T101         | TRAVELLING     | 2020-12-04 09:43:23.000 | 2020-12-04 09:43:33.000 |                 11
T101         | TURNING        | 2020-12-04 09:43:34.000 | 2020-12-04 09:43:34.000 |                  1
T101         | TRAVELLING     | 2020-12-04 09:43:35.000 | 2020-12-04 09:43:35.000 |                  1
T101         | STOPPED        | 2020-12-04 09:43:35.000 | 2020-12-04 09:43:35.000 |                  3
T102         | STOPPED        | 2020-12-04 09:43:23.000 | 2020-12-04 09:43:23.000 |                 10
T102         | STOPPPED       | 2020-12-04 09:43:33.000 | 2020-12-04 09:43:35.000 |                 10
T102         | PARKED         | 2020-12-04 09:43:35.000 | 2020-12-04 09:43:35.000 |                 10
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement