Merge groups of consecutive rows in T-SQL and sum values from each group

Updated 10/08/2019:

@Gordon Linoff: I have tried to applied your solution but I realized that is not working as expected. I have added an example with expected result with comments here (https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=1b486476d6aeab25997f25e66ee455e9) and I would be grateful if you could help me.

—

I have a table of transactions with schema:

CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);

​x
 
CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);​

Here are some row examples:

INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank)
VALUES
 (1, 1, 'Deposit',    '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600),
 (2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100),
 (3, 2, 'Deposit',    '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500),
 (4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150),
 (5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150),
 (6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200),
 (7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10),
 (8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10),
 (9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);

 
INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank)VALUES (1, 1, 'Deposit',    '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600), (2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100), (3, 2, 'Deposit',    '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500), (4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150), (5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150), (6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200), (7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10), (8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10), (9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);​

What I want is a T-SQL query to merge groups of consecutive rows by SessionId, TransactionType and from each group to keep only the row with the minimum DateTimeEnd. Also, the Rank value of the row kept has to be the sum of Rank values from the rows of the group. The T-SQL query needs to run in MS SQL Server in Microsoft Azure SQL Data Warehouse.

Desired Result:

|    Id    |     SessionId    | Transaction |       DateTimeEnd  |      DateStart     |   Rank  |
|----------|------------------|-------------|--------------------|--------------------|---------|
|    1     |         1        |      Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z|   600   |
|    2     |         1        |   Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z|   100   |
|  4       |         1        |   Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z|   300   |
|  7       |         1        |   Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z|    30   |
|    3     |         2        |      Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z|   500   |
|    6     |         2        |   Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z|   200   |

 
|    Id    |     SessionId    | Transaction |       DateTimeEnd  |      DateStart     |   Rank  ||----------|------------------|-------------|--------------------|--------------------|---------||    1     |         1        |      Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z|   600   ||    2     |         1        |   Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z|   100   ||  4       |         1        |   Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z|   300   ||  7       |         1        |   Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z|    30   ||    3     |         2        |      Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z|   500   ||    6     |         2        |   Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z|   200   |​

I have tried so many approaches and couldn’t implement it.

Answer

This is a gaps-and-islands problem, as GMB points out. Because you want to keep the first row, I’m going to suggest a lag() approach instead of the difference of row numbers:

SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRank
FROM (SELECT t.*,
             SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank
      FROM (SELECT t.*,
                   SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp
            FROM (SELECT t.*,
                         LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id,
                         LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id
                  FROM Transactions t
                 ) t
           ) t
     ) t
WHERE prev_st_id <> prev_id OR prev_st_id IS NULL;

 
SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRankFROM (SELECT t.*,             SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank      FROM (SELECT t.*,                   SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp            FROM (SELECT t.*,                         LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id,                         LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id                  FROM Transactions t                 ) t           ) t     ) tWHERE prev_st_id <> prev_id OR prev_st_id IS NULL;​

What does this do?

Innermost subquery calculates the lag of the id both overall and by session/transaction type. This uses id because it seems more stable than date/times (there are duplicate date/time values in one of the columns).
When the ids are different, then a new island is identified. The cumulative sum identifies the groups.
This grp is then used to calculate values over the entire group, using window functions.
The outer query then just filters down to the first row in each group.

Here is a db<>fiddle.

Advertisement

Answer