Sample from groups proportional to score

I have data with the following structure:

CREATE TABLE if not EXISTS scores (
  id int,
  class char,
  score float
);

INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);

​x
 
CREATE TABLE if not EXISTS scores (  id int,  class char,  score float);​INSERT INTO scores VALUES(1, 'A', 0.5),(1, 'B', 0.2),(1, 'C', 0.1),(2, 'A', 0.1),(2, 'B', 0.2),(3, 'D', 0.01),(4, 'A', 0.5),(4, 'B', 0.5);​

I want to randomly sample, for each id, a class. A possible sample obtained would be:

1,'A'
2,'B'
3,'D'
4,'A'

 
1,'A'2,'B'3,'D'4,'A'​

The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:

In id = 1, sampling class ‘B’ should be twice as likely to be sampled than class ‘C’.
In id = 2, sampling class ‘B’ should be twice as likely to be sampled than class ‘A’.
In id = 3, we should only sample class ‘D’.
In id = 4, sampling class ‘B’ should be as likely as sampling class ‘A’.

I’m looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?

Thanks!

Answer

Possible approach is to generate for each ‘id-class’ pair a number of rows equivalent to the score (50 ‘1-A’ rows, 20 ‘1-B’ rows, 10 ‘1-C’ rows, etc…) and then select 1 row randomly per id.

For BigQuery:

select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id

 
select id, array_agg(class order by rand() limit 1)[offset(0)]from scores, unnest(generate_array(1, score * 100))group by id​

Advertisement

Answer