Skip to content
Advertisement

Finding unique number of IDs in multiple groups

I have a dataset that has doctors and the various practices they work in. Each doctor in my dataset works in at least 1 practice but as many as 17 different practices. I would like to know the unique number of doctors each one works with. The current dataset is in SAS, but I am familiar with Python, Pandas, and SQL. I am fine converting the data into whichever format needed so the answer does not need to be in SAS code.

Example dataset below. This sample shows that Doctor A is in practices, P1, P3, and P5. Doctor E is in practices P1, P2, and P5,etc.

enter image description here

From this chart I would want a new column with the total number of unique doctors each one works with. In this case Doctor A works with 2 other doctors (E & D.) However, if I simply grouped by doctor and summed, I find that Doctor A works with 6 Doctors. However this is wrong because it would count Doctor A 3 times (once for each practice he is listed in) AND it would count Doctor E twice (he is in two group practices with Doctor A, P1 & P5)

I have ~ 800,000 doctors with ~400,000 group practices making manual methods unfeasible. Does any one have any suggestions on how to get this started?

The final output would look:

enter image description here

sample data set code (for SAS)

input doctor $ tot_in_group group_practices $;
datalines;
A 2 P1
E 2 P1
C 3 P2
B 3 P2
E 3 P2
A 2 P3
D 2 P3

E 2 P5
A 2 P5
;
run;

Advertisement

Answer

A self join within group excluding self-pairings will generate a table of all pairings for each group. Use that concept as the basis for counting the distinct ‘partner’s for each doctor over all groups.

For true uniqueness, be sure you are using a doctorId distinct to each individual. Attempting to prevent a ‘self-pair’ based on a name is asking for trouble. (Consider a fictitious group with doctors Dewey, Dewey, Dewey, Dewey and Dewey — yeah trouble)

data have;
input doctor $ group $;
datalines;
A P1
E P1
C P2
B P2
E P2
A P3
D P3
E P3
E P5
A P5
;
run;

proc sql;
  * demonstrate the combinatoric effect of who (P) paired with whom (Q) within group;
  * do not submit against the big data;

  create table works_with_each as
  select 
    P.doctor as P
  , Q.doctor as Q
  , P.group 
  from have as P
  join have as Q
    on P.group = Q.group
     & P.doctor ^= Q.doctor
  order by 
   P.doctor, Q.doctor, P.group
  ; 

  * count the distinct pairing, regardless of group;

  create table works_with_counts as
  select 
    P.doctor as P
  , count(distinct Q.doctor) as unique_work_with_count
  from have as P
  join have as Q
    on P.group = Q.group
     & P.doctor ^= Q.doctor
  group by 
   P.doctor
  order by 
   P.doctor
  ; 

Each
Each works with pair

Unique other in Pair (Works with) Counts
Works with Counts

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement