Skip to content
Advertisement

Sample from groups proportional to score

I have data with the following structure:

CREATE TABLE if not EXISTS scores (
  id int,
  class char,
  score float
);

INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);

I want to randomly sample, for each id, a class. A possible sample obtained would be:

1,'A'
2,'B'
3,'D'
4,'A'

The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:

  • In id = 1, sampling class ‘B’ should be twice as likely to be sampled than class ‘C’.
  • In id = 2, sampling class ‘B’ should be twice as likely to be sampled than class ‘A’.
  • In id = 3, we should only sample class ‘D’.
  • In id = 4, sampling class ‘B’ should be as likely as sampling class ‘A’.

I’m looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?

Thanks!

Advertisement

Answer

Possible approach is to generate for each ‘id-class’ pair a number of rows equivalent to the score (50 ‘1-A’ rows, 20 ‘1-B’ rows, 10 ‘1-C’ rows, etc…) and then select 1 row randomly per id.

For BigQuery:

select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement