Skip to content
Advertisement

Avoid duplicates when inserting data into a SQL Server table without unique column

I’m archiving a certain table on regular basis and would like to avoid having duplicates in that archive. There is no unique column in my table, so I’m currently using the following approach:

INSERT INTO archive (colA, colB, colC)
    SELECT
        colA, colB, colC
    FROM 
        dim.source src
    WHERE
        CONCAT(src.colA, src.colB, src.colC) NOT IN (SELECT CONCAT(colA, colB, colC) FROM archive)

This works, but given that archive contains >40M rows it’s quite slow (10+ minutes). Are there any more efficient options?

Thanks!

I’ve tried several other solutions on stackoverflow, but they didn’t work for me and result in an error message

Advertisement

Answer

MySQL allows tuple comparison, so you don’t need CONCAT:

WHERE (src.colA, src.colB, src.colC) NOT IN (SELECT colA, colB, colC FROM archive)

That kind of function use eliminates the possibility of taking advantage of any indexes; but I am not sure how efficient tuple comparisons are either, so this might be better:

LEFT JOIN archive AS a ON src.A = a.colA AND src.B = a.colB AND src.C = a.colC
WHERE a.archive_id IS NULL -- archive_id can be replaced with any field from archive you know would not be null if there were a match.

and when none of those work well, you can always try a correlated subquery instead:

WHERE NOT EXISTS (SELECT * FROM archive WHERE colA = src.A AND colB = src.B AND colC = src.C)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement