I have two data sets. The first data set contains two (uniquely) identifying characteristics – here ZIP and race – as well as a variable called count. The second data set contains information on individuals – ZIP, race, and some outcome variable. My goal is to construct a subset of the second data set where the number of observations of a certain ZIP/race combination is the count of the first data set. To make it clear:
Data Set #1:
ZIP | race | count | -------------------- 30218 | White | 59 30218 | Black | 23 30219 | White | 78 30219 | Black | 14 ...
Data Set #2:
id | ZIP | race | outcome ------------------- 001 | 30218 | White | 23.3 002 | 30219 | Black | 46.1 ...
The goal is to have a output that returns a subset of data set #2 with 59 white individuals from ZIP 30218, 23 black individuals from ZIP 30218, etc.
Either sample SQL code to use or a general strategy would be helpful. Thank you
Advertisement
Answer
You can use the row_number window function to number the rows by some criteria and then join that to data set 1. Note that I renamed count to n here to avoid using a keyword:
SELECT id, sub.zip, sub.race, sub.outcome FROM ( SELECT id, zip, race, outcome, row_number() OVER (partition by zip, race ORDER BY id) -- You can order by whatever you want FROM data_set_2 ) sub JOIN data_set_1 ON data_set_1.zip = sub.zip AND data_set_1.race = sub.race AND data_set_1.n >= row_number -- this will limit the results ;