Skip to content
Advertisement

Remove duplicate rows from a big table

I’ve got data from third party and imported to SQL server. The table has 255,072,636 records and 61,714,772 are unique among these records. The table has neither specific order nor any index. The table has 4 columns: Field1(float), Field2(varchar(255), Field3(varchar(255), Field4(varchar(255). I want to delete the duplicate record based upon Field1 for which I’ve run the following query:

WITH CTE(Field1, Field2, Field3, Field4, DuplicateCount)
AS (SELECT *, 
           ROW_NUMBER() OVER(PARTITION BY Field1 ORDER BY Field1) AS DuplicateCount
    FROM MyTable)
DELETE FROM CTE
WHERE DuplicateCount > 1;

but it took 13 hours and then I cancelled it and the above mentioned figures are after that. Is there any better way to remove duplicates?

Advertisement

Answer

Thanks to “Kazi Mohammad Ali Nur” and “eshirvana”. I’ve combine there solutions. At first I created index on Field1.

CREATE CLUSTERED INDEX Index_Name   
    ON MyTable(Field1);

and then I executed following query to insert unique records into a new table and deleted original table.

WITH CTE(Field1, Field2, Field3, Field4, DuplicateCount)
AS (SELECT *, 
           ROW_NUMBER() OVER(PARTITION BY Field1 ORDER BY Field1) AS DuplicateCount
    FROM MyTable)
select * into TempTable FROM CTE
WHERE DuplicateCount = 1;

and it worked.

Thanks to all.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement