Skip to content
Advertisement

How to pick random sample while ensuring data is unique at primary key level

I have a table with data at user-date level i.e.

userID date        eventID
A      2021-06-01   123
B      2021-06-01   342
C      2021-06-01   23487
A      2021-06-01   234221
D      2021-06-01   34245
A      2021-06-01   09834

I want to randomly pick 20% of the rows and hence used

select *
from table
where rand() <= (1/5)

Output has 20% of rows from the input table but there are some users who are repeating. I want to ensure that there is only 1 row per user and for users with multiple entries on the same day, the pick is randomized.

Advertisement

Answer

You can use row_number() for this:

select t.* except (seqnum)
from (select t.*,
             row_number() over (partition by userid order by rand()) as seqnum
      from t
     ) t
where 1=1
qualify row_number() over (order by seqnum order by rand()) < 0.2 * count(*) over ();

What is this doing? This is randomizing the rows for each user, arbitrarily assigning a sequential number to them.

Then, it is assigning a new ordering by the sequential number. So, the “first” row for each user will be fetched first, and so on. This balances the number of rows for each user.

The qualify() then chooses 20% of the rows.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement