I’ve this table below that includes ID
, and five indicator columns: x1
, …, x5
:
I need to remove duplicates based on this logic:
For each ID
, we look at the values of x1
, …, x5
, and we remove the ones that are subset of other row. For example, for ID=1, row #3 is a subset of row #2, so we remove row #3. Also, row #4 is NOT a subset of row #2, so we keep it.
Here is the expected final view of the table:
Advertisement
Answer
First concatenate all values of the 5 columns so that you get a binary string like ‘100101’ which can be converted to a base 10 number, say aliased value
, with the function CONV()
.
Assuming there are no duplicate rows in the table as you mention in your comment, a row #X should be considered a subset of another row #Y if the result of the bitwise OR
between the value
s of the 2 rows returns the value
of #Y:
SELECT t1.* FROM tablename t1 WHERE NOT EXISTS ( SELECT * FROM tablename t2 WHERE t2.ID = t1.ID AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <> (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5) AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) | CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) = CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) );
Or, for MySql 8.0+:
WITH cte AS ( SELECT *, CONV(CONCAT(x1, x2, x3, x4, x5), 2, 10) value FROM tablename ) SELECT t1.ID, t1.x1, t1.x2, t1.x3, t1.x4, t1.x5 FROM cte t1 WHERE NOT EXISTS ( SELECT * FROM cte t2 WHERE t2.ID = t1.ID AND t2.value <> t1.value AND t1.value | t2.value = t2.value );
If you want to delete the subset rows, use a self join of the table like this:
DELETE t1 FROM tablename t1 INNER JOIN tablename t2 ON t2.ID = t1.ID AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <> (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5) AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) | CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) = CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10);
See the demo.