Updated 10/08/2019:
@Gordon Linoff: I have tried to applied your solution but I realized that is not working as expected. I have added an example with expected result with comments here (https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=1b486476d6aeab25997f25e66ee455e9) and I would be grateful if you could help me.
—
I have a table of transactions with schema:
CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);
Here are some row examples:
INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank) VALUES (1, 1, 'Deposit', '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600), (2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100), (3, 2, 'Deposit', '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500), (4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150), (5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150), (6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200), (7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10), (8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10), (9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);
What I want is a T-SQL query to merge groups of consecutive rows by SessionId, TransactionType and from each group to keep only the row with the minimum DateTimeEnd. Also, the Rank value of the row kept has to be the sum of Rank values from the rows of the group. The T-SQL query needs to run in MS SQL Server in Microsoft Azure SQL Data Warehouse.
Desired Result:
| Id | SessionId | Transaction | DateTimeEnd | DateStart | Rank | |----------|------------------|-------------|--------------------|--------------------|---------| | 1 | 1 | Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z| 600 | | 2 | 1 | Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z| 100 | | 4 | 1 | Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z| 300 | | 7 | 1 | Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z| 30 | | 3 | 2 | Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z| 500 | | 6 | 2 | Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z| 200 |
I have tried so many approaches and couldn’t implement it.
Advertisement
Answer
This is a gaps-and-islands problem, as GMB points out. Because you want to keep the first row, I’m going to suggest a lag()
approach instead of the difference of row numbers:
SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRank FROM (SELECT t.*, SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank FROM (SELECT t.*, SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp FROM (SELECT t.*, LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id, LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id FROM Transactions t ) t ) t ) t WHERE prev_st_id <> prev_id OR prev_st_id IS NULL;
What does this do?
- Innermost subquery calculates the lag of the id both overall and by session/transaction type. This uses
id
because it seems more stable than date/times (there are duplicate date/time values in one of the columns). - When the ids are different, then a new island is identified. The cumulative sum identifies the groups.
- This
grp
is then used to calculate values over the entire group, using window functions. - The outer query then just filters down to the first row in each group.
Here is a db<>fiddle.