I have two data sets. The first data set contains two (uniquely) identifying characteristics – here ZIP and race – as well as a variable called count. The second data set contains information on individuals – ZIP, race, and some outcome variable. My goal is to construct a subset of the second data set where the number of observations of a certain ZIP/race combination is the count of the first data set. To make it clear:
Data Set #1:
ZIP | race | count |
--------------------
30218 | White | 59
30218 | Black | 23
30219 | White | 78
30219 | Black | 14
Data Set #2:
id | ZIP | race | outcome
-------------------
001 | 30218 | White | 23.3
002 | 30219 | Black | 46.1
The goal is to have a output that returns a subset of data set #2 with 59 white individuals from ZIP 30218, 23 black individuals from ZIP 30218, etc.
Either sample SQL code to use or a general strategy would be helpful. Thank you
Advertisement
Answer
You can use the row_number window function to number the rows by some criteria and then join that to data set 1. Note that I renamed count to n here to avoid using a keyword:
SELECT id,
sub.zip,
sub.race,
sub.outcome
FROM
(
SELECT id,
zip,
race,
outcome,
row_number() OVER (partition by zip, race ORDER BY id) -- You can order by whatever you want
FROM data_set_2
) sub
JOIN data_set_1 ON data_set_1.zip = sub.zip
AND data_set_1.race = sub.race
AND data_set_1.n >= row_number -- this will limit the results
;