Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?
Table 1: Group Value Probability A 1 5% A 10 5% A 50 20% A 30 70% B 5 5% B 25 70% B 100 25%
A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):
Table 2: Group Value A 30 B 25
I’m trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there’s a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.
Advertisement
Answer
- Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
- Pick a random number between 0 and 100.
- Find in which range that random number falls:
create or replace temp table probs as select 'a' id, 1 value, 20 prob union all select 'a', 2, 30 union all select 'a', 3, 40 union all select 'a', 4, 10 union all select 'b', 1, 5 union all select 'b', 2, 7 union all select 'b', 3, 8 union all select 'b', 4, 80; with calculated_ranges as ( select *, range_prob2-prob range_prob1 from ( select *, sum(prob) over(partition by id order by prob) range_prob2 from probs ) ) select id, random_draw, value, prob from ( select id, any_value(uniform(0, 100, random())) random_draw from probs group by id ) a join calculated_ranges b using (id) where range_prob1<=random_draw and range_prob2>random_draw ;