Skip to content
Advertisement

How to de-duplicate SQL table rows by multiple columns with hierarchy?

I have a table with multiple records for each patient.

My end goal is a table that is 1-to-1 between Patient_id and Value.

I would like to de-duplicate (in respect to patient_id) my rows based on “a hierarchical series of aggregate functions” (if someone has a better way to phrase this, I’d appreciate that as well.)

+----+------------+------------+------------+----------+-----------------+-------+
| ID | patient_id |    Date    |   Date2    | Priority |     Source      | Value |
+----+------------+------------+------------+----------+-----------------+-------+
|  1 |          1 | 2017-09-09 | 2018-09-09 |        1 | 'verified'      |    55 |
|  2 |          1 | 2017-09-09 | 2018-11-11 |        2 | 'verified'      |    78 |
|  3 |          1 | 2017-11-11 | 2018-09-09 |        3 | 'verified'      |    23 |
|  4 |          1 | 2017-11-11 | 2018-11-11 |        1 | 'self_reported' |    11 |
|  5 |          1 | 2017-09-09 | 2018-09-09 |        2 | 'self_reported' |    90 |
|  5 |          1 | 2017-09-09 | 2018-09-09 |        3 | 'self_reported' |    34 |
|  6 |          2 | 2017-11-11 | 2018-09-09 |        2 | 'self_reported' |    21 |
+----+------------+------------+------------+----------+-----------------+-------+

For each patient_id, I would like to get the row(s) that has/have the MAX(Date). In the case that there are still duplicated patient_id, I would like to get the row(s) with the MIN(Priority). In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2).

The way I’ve approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.

SELECT *
FROM #table t1
LEFT JOIN
    (SELECT 
        patient_id, 
        MIN(priority) AS min_priority
    FROM #table
    GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id
WHERE t2.min_priority = t1.priority

Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?

I’m able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.

Advertisement

Answer

You could use row_number(), if your RDBMS supports it:

select  ID, patient_id, Date, Date2, Priority, Source, Value
from (
    select
        t.*,
        row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn
    from mytable t
) where rn = 1

Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:

select t.*
from mytable t
where id = (
    select id 
    from mytable t1
    where t1.patient_id = t.patient_id
    order by t1.Date desc, t1.Priority, t1.Date2
    limit 1
)

The actual syntax for limit varies accross RDBMS.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement