How to pick random sample while ensuring data is unique at primary key level

Question

I have a table with data at user-date level i.e. userID date eventID A 2021-06-01 123 B 2021-06-01 342 C 2021-06-01 23487 A 2021-06-01 234221 D 2021-06-01 ...

Accepted Answer

You can use row_number() for this:select t.* except (seqnum)from (select t.*,             row_number() over (partition by userid order by rand()) as seqnum      from t     ) twhere 1=1qualify row_number() over (order by seqnum order by rand()) < 0.2 * count(*) over ();What is this doing?  This is randomizing the rows for each user, arbitrarily assigning a sequential number to them.Then, it is assigning a new ordering by the sequential number.  So, the &#8220;first&#8221; row for each user will be fetched first, and so on.  This balances the number of rows for each user.The qualify() then chooses 20% of the rows.

Advertisement

Answer