Skip to content
Advertisement

SQL – Finding Duplicate Records based certain criteria

I have these records in the table – employee_projects

id employee_id project_id status
1 emp1 proj1 VERIFIED
2 emp2 proj2 REJECTED
3 emp1 proj1 VERIFIED
4 emp1 proj3 REJECTED
5 emp2 proj2 REQUIRED
6 emp3 proj4 SUBMITTED
7 emp4 proj5 VERIFIED
8 emp4 proj6 VERIFIED
9 emp3 proj4 REQUIRED

Here are the criteria for determining duplicates:

  1. Same employee ID, same project ID under the same status (Example: rows 1 and 3 are duplicates)
  2. Same employee ID, same project ID but in different status (Example: rows 6 and 9 are duplicates). An exception to duplication criteria#2 is if one project is REQUIRED and the same project is also REJECTED under the same employee, this is NOT considered a duplicate. For example, rows 2 and 5 are NOT duplicates.

I have a query for the first criterion:

select
    emp_id,
    proj_id,
    status,
    COUNT(*)
from
    employee_projects
group by
    emp_id,
    proj_id,
    status
having
    COUNT(*) > 1

What I’m struggling to construct is the SQL for the second criterion.

Advertisement

Answer

maybe a self join can help you.

with t (employee_id ,project_id,status)
as
(
select 'emp1',  'proj1' ,   'VERIFIED'
Union all select 'emp2',    'proj2' ,   'REJECTED'
Union all select 'emp1',    'proj1' ,   'VERIFIED'
Union all select 'emp1',    'proj3' ,   'REJECTED'
Union all select 'emp2',    'proj2' ,   'REQUIRED'
Union all select 'emp3',    'proj4' ,   'SUBMITTED'
Union all select 'emp4',    'proj5' ,   'VERIFIED'
Union all select 'emp4',    'proj6' ,   'VERIFIED'
Union all select 'emp3',    'proj4' ,   'REQUIRED'
)
select
    t.employee_id,
    t.project_id,
    t.status,
    '' as status,
    'criteria#1' as SQL
from
    t
group by
    t.employee_id,
    t.project_id,
    t.status
having
    COUNT(*) > 1
union all
SELECT 
    t.employee_id,
    t.project_id,
    t.status,
    a.status,
    'criteria#2' as SQL
FROM
    t
    left join t as a on 
        t.employee_id = a.employee_id and
        t.project_id = a.project_id
where 
    t.status != a.status and
    concat(t.status,a.status) != 'REQUIREDREJECTED' and
    concat(t.status,a.status) != 'REJECTEDREQUIRED'
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement