I have a table with data at user-date level i.e.
userID date eventID A 2021-06-01 123 B 2021-06-01 342 C 2021-06-01 23487 A 2021-06-01 234221 D 2021-06-01 34245 A 2021-06-01 09834
I want to randomly pick 20% of the rows and hence used
select * from table where rand() <= (1/5)
Output has 20% of rows from the input table but there are some users who are repeating. I want to ensure that there is only 1 row per user and for users with multiple entries on the same day, the pick is randomized.
Advertisement
Answer
You can use row_number()
for this:
select t.* except (seqnum) from (select t.*, row_number() over (partition by userid order by rand()) as seqnum from t ) t where 1=1 qualify row_number() over (order by seqnum order by rand()) < 0.2 * count(*) over ();
What is this doing? This is randomizing the rows for each user, arbitrarily assigning a sequential number to them.
Then, it is assigning a new ordering by the sequential number. So, the “first” row for each user will be fetched first, and so on. This balances the number of rows for each user.
The qualify()
then chooses 20% of the rows.