Skip to content
Advertisement

How to pick random sample while ensuring data is unique at primary key level

I have a table with data at user-date level i.e.

I want to randomly pick 20% of the rows and hence used

Output has 20% of rows from the input table but there are some users who are repeating. I want to ensure that there is only 1 row per user and for users with multiple entries on the same day, the pick is randomized.

Advertisement

Answer

You can use row_number() for this:

What is this doing? This is randomizing the rows for each user, arbitrarily assigning a sequential number to them.

Then, it is assigning a new ordering by the sequential number. So, the “first” row for each user will be fetched first, and so on. This balances the number of rows for each user.

The qualify() then chooses 20% of the rows.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement