Skip to content
Advertisement

Understanding window functions to deduplicate records while retaining true changes

I asked this in dba stack exchange but haven’t had any luck. Cross-posting.

SQLFIDDLE

I am close to figuring this out but I’m just stuck at a wall. I’m attempting to understand a post by Aaron Betrand and apply it to a situation I’ve encountered where I have a changes table that’s heavily duplicated due to prior design error I’m inheriting. The sample data set is identical in concept to my real data set, except SortOrder would usually be a datetime value and not an integer. The code I’ve tried is here:

; with main as (
   select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID,
      SortOrder) as "Rank",
      row_number() over (partition by ID, val order by ID, sortorder) as "s_rank" 
   from 
      (values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4),
              (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), 
              (3, 'A', 1), (3, 'A', 1), (3, 'A', 2)
      ) as x("ID", "VAL", "SortOrder")
   group by id, val, SortOrder
   --order by ID, "SortOrder"
),
cte_rest as (
   select *
   from main
   where "s_rank" > 1
)

select *
from main
left join cte_rest rest
   on main.id = rest.id
   and main.s_rank > 1
   and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder

The results are almost valid; however, the last row highlights a situation that I haven’t been able to account for: the date changes, the value doesn’t. I want the last record to be excluded because it’s not a true value change.

╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║  ID  ║ VAL  ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║  1 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ C   ║         3 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         4 ║    1 ║      2 ║ 1    ║ B    ║ 4         ║ 1    ║ 2      ║
║  1 ║ A   ║         5 ║    1 ║      2 ║ 1    ║ A    ║ 5         ║ 1    ║ 2      ║
║  2 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ A   ║         3 ║    1 ║      2 ║ 2    ║ A    ║ 3         ║ 1    ║ 2      ║
║  3 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  3 ║ A   ║         2 ║    1 ║      2 ║ 3    ║ A    ║ 2         ║ 1    ║ 2      ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝

A colleague of mine suggested this code, and while I can follow how it arrives, I don’t understand why the first code sample doesn’t work. It feels to me like this would require a lot of extra parsing, and with a large data set I’d be worried about performance impacts.

WITH cte1
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   (VALUES
                        ( 1, 'A', 1 ),
                        ( 1, 'A', 1 ),
                        ( 1, 'B', 2 ),
                        ( 1, 'C', 3 ),
                        ( 1, 'B', 4 ),
                        ( 1, 'A', 5 ),
                        ( 1, 'A', 5 ),
                        ( 2, 'A', 1 ),
                        ( 2, 'B', 2 ),
                        ( 2, 'A', 3 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
     ctedropped
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   cte1
         WHERE  [cte1].[rankall] > 1)
     SELECT [cte1].[id]
          , [cte1].[val]
          , [cte1].[sortorder]
     FROM   cte1
     WHERE  NOT EXISTS
     (
         SELECT *
         FROM   [ctedropped]
         WHERE  [cte1].[id] = [ctedropped].[id] AND 
                [cte1].[val] = [ctedropped].[val] AND 
                [cte1].[rankall] = [ctedropped].[rankall]
     )
     ORDER BY [cte1].[id]
            , [cte1].[sortorder];

Advertisement

Answer

If you just want to remove rows where the value doesn’t change you can apply this logic:

WITH cte1 AS
 (
   SELECT [id]
        , [val]
        , [sortorder]
        , Lag(val) Over(PARTITION BY [id]
                        ORDER BY [sortorder]) AS prevval
   FROM    demo
 )
SELECT * 
FROM cte1
WHERE prevval IS NULL  -- first row
   OR prevval <> val   -- value changed

See fiddle

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement