Skip to content
Advertisement

How to Select ID’s in SQL (Databricks) in which at least 2 items from a list are present

I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like:

CLAIM_ID | PTNT_ID | ICD_CD | DATE
---------+---------+--------+------------
1           101      2500     01_25_2020
2           101      3850     03_13_2018  
3           222      2500     10_26_2018
4           222      8888     11_30_2018
5           222      9155     04_01_2019
6           871      2500     02_17_2020
7           871      3200     09_09_2019

The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return TOTAL UNIQUE PTNT_ID = 2. These would be PTNT_ID = (101, 222) as these are the only two patients that have at least 2 ICD_CD codes of interest.

When I use something like this, I’m able to return all of the relevant PTNT_ID values, but I’m not able to get the total count of these PTNT_ID:

select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest
)
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2

When I try to add a COUNT statement in, it returns an error

Advertisement

Answer

Just select from the query:

select count(*)
from
(
  select mc.PTNT_ID
  from MEDICAL_CLAIMS mc
  where mc.PTNT_ID in ( # list of ICD_CD of interest )
  group by mc.PTNT_ID
  having count(distinct mc.PTNT) >= 2
) ptnts;
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement