I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like:
CLAIM_ID | PTNT_ID | ICD_CD | DATE ---------+---------+--------+------------ 1 101 2500 01_25_2020 2 101 3850 03_13_2018 3 222 2500 10_26_2018 4 222 8888 11_30_2018 5 222 9155 04_01_2019 6 871 2500 02_17_2020 7 871 3200 09_09_2019
The list of ICD_CD
codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return TOTAL UNIQUE PTNT_ID = 2. These would be PTNT_ID
= (101, 222) as these are the only two patients that have at least 2 ICD_CD
codes of interest.
When I use something like this, I’m able to return all of the relevant PTNT_ID
values, but I’m not able to get the total count of these PTNT_ID
:
select mc.PTNT_ID from MEDICAL_CLAIMS mc where mc.PTNT_ID in ( # list of ICD_CD of interest ) group by mc.PTNT_ID having count(distinct mc.PTNT) >= 2
When I try to add a COUNT statement in, it returns an error
Advertisement
Answer
Just select from the query:
select count(*) from ( select mc.PTNT_ID from MEDICAL_CLAIMS mc where mc.PTNT_ID in ( # list of ICD_CD of interest ) group by mc.PTNT_ID having count(distinct mc.PTNT) >= 2 ) ptnts;