I’m archiving a certain table on regular basis and would like to avoid having duplicates in that archive. There is no unique column in my table, so I’m currently using the following approach:
INSERT INTO archive (colA, colB, colC) SELECT colA, colB, colC FROM dim.source src WHERE CONCAT(src.colA, src.colB, src.colC) NOT IN (SELECT CONCAT(colA, colB, colC) FROM archive)
This works, but given that archive contains >40M rows it’s quite slow (10+ minutes). Are there any more efficient options?
Thanks!
I’ve tried several other solutions on stackoverflow, but they didn’t work for me and result in an error message
Advertisement
Answer
MySQL allows tuple comparison, so you don’t need CONCAT:
WHERE (src.colA, src.colB, src.colC) NOT IN (SELECT colA, colB, colC FROM archive)
That kind of function use eliminates the possibility of taking advantage of any indexes; but I am not sure how efficient tuple comparisons are either, so this might be better:
LEFT JOIN archive AS a ON src.A = a.colA AND src.B = a.colB AND src.C = a.colC WHERE a.archive_id IS NULL -- archive_id can be replaced with any field from archive you know would not be null if there were a match.
and when none of those work well, you can always try a correlated subquery instead:
WHERE NOT EXISTS (SELECT * FROM archive WHERE colA = src.A AND colB = src.B AND colC = src.C)