I asked this in dba stack exchange but haven’t had any luck. Cross-posting.
I am close to figuring this out but I’m just stuck at a wall. I’m attempting to understand a post by Aaron Betrand and apply it to a situation I’ve encountered where I have a changes table that’s heavily duplicated due to prior design error I’m inheriting. The sample data set is identical in concept to my real data set, except SortOrder would usually be a datetime value and not an integer. The code I’ve tried is here:
; with main as ( select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID, SortOrder) as "Rank", row_number() over (partition by ID, val order by ID, sortorder) as "s_rank" from (values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4), (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), (3, 'A', 1), (3, 'A', 1), (3, 'A', 2) ) as x("ID", "VAL", "SortOrder") group by id, val, SortOrder --order by ID, "SortOrder" ), cte_rest as ( select * from main where "s_rank" > 1 ) select * from main left join cte_rest rest on main.id = rest.id and main.s_rank > 1 and main.SortOrder = rest.SortOrder --where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank) order by main.ID, main.SortOrder
The results are almost valid; however, the last row highlights a situation that I haven’t been able to account for: the date changes, the value doesn’t. I want the last record to be excluded because it’s not a true value change.
╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗ ║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║ ╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣ ║ 1 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 1 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 1 ║ C ║ 3 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 1 ║ B ║ 4 ║ 1 ║ 2 ║ 1 ║ B ║ 4 ║ 1 ║ 2 ║ ║ 1 ║ A ║ 5 ║ 1 ║ 2 ║ 1 ║ A ║ 5 ║ 1 ║ 2 ║ ║ 2 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 2 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 2 ║ A ║ 3 ║ 1 ║ 2 ║ 2 ║ A ║ 3 ║ 1 ║ 2 ║ ║ 3 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║ ║ 3 ║ A ║ 2 ║ 1 ║ 2 ║ 3 ║ A ║ 2 ║ 1 ║ 2 ║ ╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝
A colleague of mine suggested this code, and while I can follow how it arrives, I don’t understand why the first code sample doesn’t work. It feels to me like this would require a lot of extra parsing, and with a large data set I’d be worried about performance impacts.
WITH cte1 AS (SELECT [id] , [val] , [sortorder] , ROW_NUMBER() OVER(PARTITION BY [id] , [val] , [sortorder] ORDER BY [id] , [sortorder]) AS "rankall" FROM (VALUES ( 1, 'A', 1 ), ( 1, 'A', 1 ), ( 1, 'B', 2 ), ( 1, 'C', 3 ), ( 1, 'B', 4 ), ( 1, 'A', 5 ), ( 1, 'A', 5 ), ( 2, 'A', 1 ), ( 2, 'B', 2 ), ( 2, 'A', 3 ), ( 3, 'A', 1 ), ( 3, 'A', 1 ), ( 3, 'A', 2 )) AS x("id", "val", "sortorder")), ctedropped AS (SELECT [id] , [val] , [sortorder] , ROW_NUMBER() OVER(PARTITION BY [id] , [val] , [sortorder] ORDER BY [id] , [sortorder]) AS "rankall" FROM cte1 WHERE [cte1].[rankall] > 1) SELECT [cte1].[id] , [cte1].[val] , [cte1].[sortorder] FROM cte1 WHERE NOT EXISTS ( SELECT * FROM [ctedropped] WHERE [cte1].[id] = [ctedropped].[id] AND [cte1].[val] = [ctedropped].[val] AND [cte1].[rankall] = [ctedropped].[rankall] ) ORDER BY [cte1].[id] , [cte1].[sortorder];
Advertisement
Answer
If you just want to remove rows where the value doesn’t change you can apply this logic:
WITH cte1 AS ( SELECT [id] , [val] , [sortorder] , Lag(val) Over(PARTITION BY [id] ORDER BY [sortorder]) AS prevval FROM demo ) SELECT * FROM cte1 WHERE prevval IS NULL -- first row OR prevval <> val -- value changed
See fiddle