I’m hoping to find duplicates across groups of IDs.
The Dup variable defines whether an Item has reappeared in other ID groups.In other words, if the same item is found in another ID then Dup is ‘yes’. If the Item only repeats within the same ID, then Dup is ‘no’.
The outcome will be like the following:
Table1:
+----+--------+-----+ | ID | Item | Dup | +----+--------+-----+ | a | apple | yes | | a | apple | yes | | b | apple | yes | | c | orange | no | | c | orange | no | | d | pear | yes | | f | pear | yes | | e | mango | no | +----+--------+-----+
How should I construct a sql query to detect the duplicates (I work in hive environment, but most syntax should be similar)? Any help will be appreciative!
Advertisement
Answer
You can use window functions like this:
select t.*, (case when min(id) over (partition by item) = max(id) over (partition by item) then 'no' else 'yes' end) as dups from t;