Impala: count(distinct) with multiple where statement criteria?

I have a health outcome database where I’d like to be able to query counts of subjects that meet a specific diagnosis string within a certain date range. Dummy data (my_table) would look like this (Date as a timestamp):

subjid   Diagnosis  Date
----------------------------------------
Subj001  Z12345     2019-02-05 00:00:00
Subj001  Z12345     2017-01-10 00:00:00
Subj002  Z12345     2018-08-14 00:00:00
Subj002  Z12345     2014-03-20 00:00:00
Subj002  Z12345     2013-07-23 00:00:00
Subj003  Y56789     2016-08-16 00:00:00

There can be multiple entries for each subject in subjid, each with a corresponding Diagnosis code. So far my query looks like:

select 
    subjid, Diagnosis, Date,
    count(subjid) over (partition by Diagnosis) as count 
from 
    my_table
where 
    Diagnosis in ('Z12345') 
    and diag_date >= '2014-01-01 00:00:00'

However, the issue is that I can’t include a distinct statement within the parens for count, as this returns an error. What I’m looking for is the count of unique subjects that satisfy the Diagnosis code string after a particular date; I’m not interested in how many times the Diagnosis code string occurs for each patient.

Question: is there a way to count the total number of unique occurrences of each given subject that match a particular diagnosis string after a particular date?

Solution needs to be in Impala. Thanks in advance for any suggestions.

Answer

One method is the sum of dense_rank()s:

select subjid, Diagnosis, Date,
       (dense_rank() over (partition by diagnosis order by subjid asc) +
        dense_rank() over (partition by diagnosis order by subjid desc)
       ) as num_subjids
from my_table
where Diagnosis in ('Z12345') and
      diag_date >= '2014-01-01 00:00:00';

Advertisement

Answer