I have a dataset that has doctors and the various practices they work in. Each doctor in my dataset works in at least 1 practice but as many as 17 different practices. I would like to know the unique number of doctors each one works with. The current dataset is in SAS, but I am familiar with Python, Pandas, and SQL. I am fine converting the data into whichever format needed so the answer does not need to be in SAS code.
Example dataset below. This sample shows that Doctor A is in practices, P1, P3, and P5. Doctor E is in practices P1, P2, and P5,etc.
From this chart I would want a new column with the total number of unique doctors each one works with. In this case Doctor A works with 2 other doctors (E & D.) However, if I simply grouped by doctor and summed, I find that Doctor A works with 6 Doctors. However this is wrong because it would count Doctor A 3 times (once for each practice he is listed in) AND it would count Doctor E twice (he is in two group practices with Doctor A, P1 & P5)
I have ~ 800,000 doctors with ~400,000 group practices making manual methods unfeasible. Does any one have any suggestions on how to get this started?
The final output would look:
sample data set code (for SAS)
input doctor $ tot_in_group group_practices $; datalines; A 2 P1 E 2 P1 C 3 P2 B 3 P2 E 3 P2 A 2 P3 D 2 P3 E 2 P5 A 2 P5 ; run;
Advertisement
Answer
A self join within group excluding self-pairings will generate a table of all pairings for each group. Use that concept as the basis for counting the distinct ‘partner’s for each doctor over all groups.
For true uniqueness, be sure you are using a doctorId
distinct to each individual. Attempting to prevent a ‘self-pair’ based on a name is asking for trouble. (Consider a fictitious group with doctors Dewey, Dewey, Dewey, Dewey and Dewey — yeah trouble)
data have; input doctor $ group $; datalines; A P1 E P1 C P2 B P2 E P2 A P3 D P3 E P3 E P5 A P5 ; run; proc sql; * demonstrate the combinatoric effect of who (P) paired with whom (Q) within group; * do not submit against the big data; create table works_with_each as select P.doctor as P , Q.doctor as Q , P.group from have as P join have as Q on P.group = Q.group & P.doctor ^= Q.doctor order by P.doctor, Q.doctor, P.group ; * count the distinct pairing, regardless of group; create table works_with_counts as select P.doctor as P , count(distinct Q.doctor) as unique_work_with_count from have as P join have as Q on P.group = Q.group & P.doctor ^= Q.doctor group by P.doctor order by P.doctor ;