I’m learning MySQL and today I tried to solve an MySQL question on leetcode: https://leetcode.com/problems/delete-duplicate-emails/solution/
+----+------------------+ | Id | Email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Id is the primary key column for this table.
The basic idea is to remove duplicate rows from the table and only keep the one with the smallest Id. Expect results:
+----+------------------+ | Id | Email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+
The solution is
DELETE p1 FROM Person p1,Person p2 WHERE p1.Email = p2.Email AND p1.Id > p2.Id
And I tried a second test case myself:
+----+------------------+ | Id | Email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | | 4 | john@example.com | +----+------------------+
The output is:
+----+------------------+ | Id | Email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+
I don’t understand why the condition p1.Id > p2.Id only keeps the row with the smallest id because I think it only rules out the smaller one, it wouldn’t apply to the case where there were more than 2 duplicates, right? But as the second test case shows, it indeed works. I think it’s because of the syntax of
DELETE Table1 FROM Table1 INNER JOIN Table2 ON Table1.ID = Table2.ID
How does this work exactly? Can someone gives me an explanation here, thanks!
Advertisement
Answer
First, this is a very bad way of implementing this code. But I guess you get what you pay for.
Second, simply run the query as a select
:
SELECT p1.*, p2.* FROM Person p1 JOIN Person p2 ON p1.Email = p2.Email AND p1.Id > p2.Id;
(Note that I’ve rewritten the logic as a JOIN
. You should always use proper, explicit, standard, readable JOIN
syntax, but the two methods are functionally equivalent.)
On your second example, the results of this query are:
table1 email table1 id table2 id john@example.com. 2 1 john@example.com. 3 1 john@example.com. 3 2
What is notable is that id = 1 is never in the second column — and that is the column that determines which ids are deleted. In other words, all but the smallest id for each email get deleted because there is a smaller id
.
This also hints at why this is a really bad solution. MySQL has to deal with two rows for id = 3
. Perhaps it attempts to delete both. Perhaps it has to just deal with extra data. Either way, there is extra work. And the more rows with the same email in the data the more extra duplicates are created.
An alternative method, such as:
delete p from person p join (select email, min(id) as min_id from person p2 group by email ) p2 on p.email = p2.email and p.id > p2.min_id;
Does not have this problem and, in my opinion, the intent is clearer.