Skip to content
Advertisement

SQL : Identify item duplicates across groups

I’m hoping to find duplicates across groups of IDs.

The Dup variable defines whether an Item has reappeared in other ID groups.In other words, if the same item is found in another ID then Dup is ‘yes’. If the Item only repeats within the same ID, then Dup is ‘no’.

The outcome will be like the following:

Table1:

+----+--------+-----+
| ID |  Item  | Dup |
+----+--------+-----+
| a  | apple  | yes |
| a  | apple  | yes |
| b  | apple  | yes |
| c  | orange | no  |
| c  | orange | no  |
| d  | pear   | yes |
| f  | pear   | yes |
| e  | mango  | no  |
+----+--------+-----+

How should I construct a sql query to detect the duplicates (I work in hive environment, but most syntax should be similar)? Any help will be appreciative!

Advertisement

Answer

You can use window functions like this:

select t.*,
       (case when min(id) over (partition by item) = max(id) over (partition by item)
             then 'no' else 'yes'
        end) as dups
from t;
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement