Skip to content
Advertisement

SQL: exclude rows that are duplicates according to some column arrangement

In my problem, I have these rows e.g.:

    id1  |name1| id2 | name2| c 
1   10994,Apple,22265,Banana,103
2   22265,Banana,10994,Apple,103
3   20945,Coconut,20391,Date,101
4   20391,Date,20945,Coconut,101

They show pair-wise combinations of ids and names, together with another column c.

I consider row 1+2, and 3+4 as duplicates, considering the pairings of ids and names. Rows 1+2, or 3+4 show basically the same information.

I had no luck removing the duplicates with grouping, because id1 + id2, or name1 + name2 respectively are distinct columns.

Can I add something to my SQL query that removes these ‘duplicates’, so that only rows 1+3 are output?

Advertisement

Answer

One approach here would to be aggregate by the c column and then use a least/greatest trick to tease apart the duplicates. Here is a general solution which should work on any version of SQL:

SELECT DISTINCT c, CASE WHEN id1 < id2 THEN id1 ELSE id2 END AS id1,
                   CASE WHEN id1 < id2 THEN id2 ELSE id1 END AS id2,
                   CASE WHEN name1 < name2 THEN name1 ELSE name2 END AS name1,
                   CASE WHEN name1 < name2 THEN name2 ELSE name1 END AS name2
FROM yourTable;

Here is a working demo.

Note that on databases which have a LEAST and GREATEST function, the above can be simplified to:

SELECT DISTINCT c, LEAST(id1, id2) AS id1,
                   GREATEST(id1, id2) AS id2,
                   LEAST(name1, name2) AS name1,
                   GREATEST(name1, name2) AS name2
FROM yourTable;
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement