Skip to content
Advertisement

Filter duplicate rows in Postgres based on conditions between those rows

Given a table

CREATE TABLE data(
 irs_number VARCHAR (50),
 mop_up INTEGER,
 ou VARCHAR (50)
);

How would I return all matching records that…

  • have at least one identical value for irs_number in another row AND
  • at least one mop_up of those with the same irs_number must be set to 1 AND
  • the ou values are not identical, i.e. only return those not matching to a row with the identical irs_number.

… so that all irs_numbers would be returned (not only one where the conditions are true – see example below).

I tried this but the query cannot finish within reasonable time:

SELECT irs_number, mop_up, ou
FROM data outer_data
WHERE (SELECT count(*)
FROM data inner_data
WHERE inner_data.irs_number = outer_data.irs_number
AND inner_data.mop_up = 1 OR outer_data.mop_up = 1
AND inner_data.ou <> outer_data.ou
);

As well as variations of duplicate counts as described here: How to find duplicate records in PostgreSQL – they will always just return the duplicates but not the proper filter applied.


edit:

Example data:

INSERT INTO data VALUES 
('0001', 1, 'abc'),
('0001', 0, 'abc'),
('0001', 0, 'cde'),
('0001', 0, 'abc'),
('0002', 1, 'abc'),
('0002', 0, 'abc'),
('0003', 0, 'abc'),
('0003', 0, 'xyz')
;

SQLFiddle: http://sqlfiddle.com/#!17/be28f

a query should ideally return:

irs_number  mop_up  ou
-----------------------
0001        1       abc
0001        0       abc
0001        0       cde
0001        0       abc

(order not important) meaning it should return all rows matching having the irs_number with the condition above.

Advertisement

Answer

You should be able to do this with a simple exists clause:

SELECT irs_number, mop_up, ou
FROM data d
WHERE EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.mop_up = 1 AND
                    d2.ou <> d.ou
             );

EDIT:

The above misinterpreted the question. It assumed that a mop_up = 1 needed to be on a different ou. As I read the question, this is ambiguous but doesn’t appear to be what you want. So, two exists address this:

SELECT irs_number, mop_up, ou
FROM data d
WHERE EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.mop_up = 1
             ) AND
     EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.ou <> d.ou
             );

Here is a db<>fiddle.

Both these solutions will be able to take advantage of an index on (irs_number, mop_up, ou).

Advertisement