I have a table in the form
ID | DOC ------------- id1 | d1 ------------- id1 | d2 ------------- id2 | d3 ------------- id2 | d4 ------------- id3 | d5 -------------
The goal is to group the table by ID and for each group, select a random number from the number of groups (in this case, select a random number from [1, 3]) and assign all rows of each group one number. One possible configuration is
ID | DOC | GROUP_NUM -------------------------- id1 | d1 | 2 -------------------------- id1 | d2 | 2 -------------------------- id2 | d3 | 1 -------------------------- id2 | d4 | 1 -------------------------- id3 | d5 | 3 --------------------------
I was thinking of using ROW_NUMBER() and PARTITION() functions. What is a better way to go about it considering the table in Bigquery is quite big?
Advertisement
Answer
If the random number can be sequential, you can use dense_rank()
:
select t.*, dense_rank() over (order by id) as group_num from t;
Or for a bit more randomness:
select t.*, dense_rank() over (order by farm_fingerprint(cast(id as string)), id) as group_num from t;
Alternatively, a separate calculation by id
might be simplest:
select * from t join (select id, dense_rank() over (order by rand()) as group_num from t group by id ) tt using (id)