I have a table with multiple records for each patient.
My end goal is a table that is 1-to-1 between Patient_id
and Value
.
I would like to de-duplicate (in respect to patient_id
) my rows based on “a hierarchical series of aggregate functions” (if someone has a better way to phrase this, I’d appreciate that as well.)
+----+------------+------------+------------+----------+-----------------+-------+ | ID | patient_id | Date | Date2 | Priority | Source | Value | +----+------------+------------+------------+----------+-----------------+-------+ | 1 | 1 | 2017-09-09 | 2018-09-09 | 1 | 'verified' | 55 | | 2 | 1 | 2017-09-09 | 2018-11-11 | 2 | 'verified' | 78 | | 3 | 1 | 2017-11-11 | 2018-09-09 | 3 | 'verified' | 23 | | 4 | 1 | 2017-11-11 | 2018-11-11 | 1 | 'self_reported' | 11 | | 5 | 1 | 2017-09-09 | 2018-09-09 | 2 | 'self_reported' | 90 | | 5 | 1 | 2017-09-09 | 2018-09-09 | 3 | 'self_reported' | 34 | | 6 | 2 | 2017-11-11 | 2018-09-09 | 2 | 'self_reported' | 21 | +----+------------+------------+------------+----------+-----------------+-------+
For each patient_id
, I would like to get the row(s) that has/have the MAX(Date)
. In the case that there are still duplicated patient_id
, I would like to get the row(s) with the MIN(Priority)
. In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2)
.
The way I’ve approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.
SELECT * FROM #table t1 LEFT JOIN (SELECT patient_id, MIN(priority) AS min_priority FROM #table GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id WHERE t2.min_priority = t1.priority
Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?
I’m able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.
Advertisement
Answer
You could use row_number()
, if your RDBMS supports it:
select ID, patient_id, Date, Date2, Priority, Source, Value from ( select t.*, row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn from mytable t ) where rn = 1
Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:
select t.* from mytable t where id = ( select id from mytable t1 where t1.patient_id = t.patient_id order by t1.Date desc, t1.Priority, t1.Date2 limit 1 )
The actual syntax for limit
varies accross RDBMS.