Skip to content
Advertisement

Deleting duplicate entries in table depends on several columns

I want to delete all duplicate entries in my MySQL database.

I found a lot of solutions in the web but in every case the query worked only when you are looking up for 1 column.

In my case I need a query for more than 1 row which will be executed as far as possible – the table is 500MB big with more than 6 Million entries.

My Table (example)

id  name status   email
----------------------------
x   Mark  1      Mark@trash-mail.com
x   Anna  1      Anna@trash-mail.com
x   Mark  0      Mark@trash-mail.com
x   Mark  1      Mark@trash-mail.com

What should now happen? I need a query like this:

DELETE * FROM my_table WHERE (name == name, status == status, email == email)

When executing this query the table has to look like this:

id  name status   email
----------------------------
x   Mark  1      Mark@trash-mail.com
x   Anna  1      Anna@trash-mail.com
x   Mark  0      Mark@trash-mail.com

The last Mark-Entry was deleted because the name, status, and email field was equal to another. In some cases there are 20 or more with the same data which has to be deleted so that there is just one left.

Currently I have a bad solution with a PHP-Script. I am going to each row, looking for the same data and delete it. It works like it should but its too slow… I think maybe 5.000 Items per Hour…

You know any solution to solve this in a better way?

Advertisement

Answer

My suggestion would be to use the method where you truncate the table and reinsert the data. Something like:

create temporary table temp as
    select min(id) as id, name, status, email
    from mytable
    group by name, status, email;

truncate table mytable;

insert into mytable(id, name, status, email)
    select id, name, status, email
    from temp;

Doing the deleting in place can be quite expensive. If you decide to do so, I would create the same temporary table with an index on id, and then use:

delete m
    from mytable m left join
         temp
         on m.id = temp.id
    where temp.id is null;

You can use a limit clause (such as limit 10000) to run this in batches.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement