I have a database table with 33265902 (thirty-three million two hundred sixty-five thousand nine hundred two) records in it.
Much of this is made up of duplicates.
It is causing me issues as any website that queries this table is causing load on the server.
What makes this table unique is two fields, email_address and job_id
My question is what is the best way to delete all duplicates leaving one good record per field criteria.
What is the most efficient way to do this without killing the server.
Advertisement
Answer
The most efficient way is probably to keep one row per table in a new table. Then truncate the old table and re-insert the values:
select t.* -- or select all columns except seqnum into temp_table from (select t.*, row_number() over (partition by email_address, job_id order by job_id) as seqnum from t ) t where seqnum = 1; -- optional if you remove seqnum above -- alter table temp_table drop column seqnum; -- back this up first! truncate table t; insert to t select * from temp_table;
You could drop the original table and rename temp_table
to the table name. If you do so, remember to create constraints, indexes, triggers, partitions and whatever else is on the original table.
This will not be super fast (you can see how long it takes to create the temporary table). However, it does not require the locking and logging from deleting the original table. As a bonus, the new table should be less fragmented than if you did deletes.