Skip to content
Advertisement

SQL Query: Constructing a Control Group

I have two data sets. The first data set contains two (uniquely) identifying characteristics – here ZIP and race – as well as a variable called count. The second data set contains information on individuals – ZIP, race, and some outcome variable. My goal is to construct a subset of the second data set where the number of observations of a certain ZIP/race combination is the count of the first data set. To make it clear:

Data Set #1:

ZIP | race | count |
--------------------
30218 | White  |  59   
30218 | Black  |  23  
30219 | White  | 78  
30219 | Black  | 14  
...

Data Set #2:

id | ZIP | race | outcome
-------------------  
001 | 30218 | White | 23.3  
002 | 30219 | Black | 46.1  
...

The goal is to have a output that returns a subset of data set #2 with 59 white individuals from ZIP 30218, 23 black individuals from ZIP 30218, etc.

Either sample SQL code to use or a general strategy would be helpful. Thank you

Advertisement

Answer

You can use the row_number window function to number the rows by some criteria and then join that to data set 1. Note that I renamed count to n here to avoid using a keyword:

SELECT id, 
       sub.zip, 
       sub.race, 
       sub.outcome 
FROM
  (
    SELECT id, 
           zip, 
           race, 
           outcome, 
           row_number() OVER (partition by zip, race ORDER BY id) -- You can order by whatever you want
    FROM data_set_2
  ) sub
JOIN data_set_1 ON data_set_1.zip = sub.zip 
               AND data_set_1.race = sub.race 
               AND data_set_1.n >= row_number -- this will limit the results
;
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement