I have data with the following structure:
CREATE TABLE if not EXISTS scores ( id int, class char, score float ); INSERT INTO scores VALUES (1, 'A', 0.5), (1, 'B', 0.2), (1, 'C', 0.1), (2, 'A', 0.1), (2, 'B', 0.2), (3, 'D', 0.01), (4, 'A', 0.5), (4, 'B', 0.5);
I want to randomly sample, for each id, a class. A possible sample obtained would be:
1,'A' 2,'B' 3,'D' 4,'A'
The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:
- In
id = 1
, sampling class ‘B’ should be twice as likely to be sampled than class ‘C’. - In
id = 2
, sampling class ‘B’ should be twice as likely to be sampled than class ‘A’. - In
id = 3
, we should only sample class ‘D’. - In
id = 4
, sampling class ‘B’ should be as likely as sampling class ‘A’.
I’m looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?
Thanks!
Advertisement
Answer
Possible approach is to generate for each ‘id-class’ pair a number of rows equivalent to the score (50 ‘1-A’ rows, 20 ‘1-B’ rows, 10 ‘1-C’ rows, etc…) and then select 1 row randomly per id.
For BigQuery:
select id, array_agg(class order by rand() limit 1)[offset(0)] from scores, unnest(generate_array(1, score * 100)) group by id