Skip to content
Advertisement

Deleting massive number of duplicate records without using a new table

Right now I have a table that has a massive number of duplicates that need to be deleted (about 500 million).

I have a query that will delete all the duplicates, but it is unable to finish the whole query due to the transaction log being filled up.

Moving the nonduplicates to a new table, then renaming it, would work, but in this scenario I am not able to do it this way. This will be performed in a production environment so I can not drop that d1 table.

Same with the other solution that involved changing some sort of backup transaction log setting.

This is my query:

;WITH CTE AS 
(
    SELECT 
        d_id, d_record, d_d2id, 
        ROW_NUMBER() OVER (PARTITION BY d_record, d_d2id ORDER BY d_id) RowNumber
    FROM 
        d1
    WHERE 
        d_d2id >= 25 AND d_d2id <= 28
)
DELETE FROM CTE 
WHERE RowNumber > 1

Obviously this will work, however due to the amount of deletes this will have to perform, it will blow up the transaction log.

Is there a way to create this specific CTE then go through it in batches of 1000 records and deleting them that way thus leaving a whole bunch of transactions instead of 1? Or is there another way I can accomplish this? The only solution I have is to loop through these duplicates and delete them without blowing up the transaction log.

Thanks!

Advertisement

Answer

In SQL Server, you can delete in batches. Although this is not the most efficient code, it illustrates the idea of deleting in batches:

DECLARE @go_on INT
SELECT @go_on = 1;

WHILE (@go_on = 1)
BEGIN
    WITH TODELETE AS (
          SELECT TOP (10000) d1.*
          FROM (SELECT d1.*,
                       ROW_NUMBER() OVER (PARTITION BY d_record, d_d2id ORDER BY d_id) as seqnum
                FROM d1
                WHERE d_d2id >= 25 AND d_d2id <= 28
               ) d1
          WHERE seqnum > 1
         )
    DELETE FROM TODELETE; 

    SET @go_on = (CASE WHEN @@ROWCOUNT > 0 THEN 1 ELSE 0 END);
END;

It would be more efficient to store the rows to be deleted in a temporary table or table variable, so they don’t need to be recalculated each time.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement