Skip to content
Advertisement

group by id and random sample by id from two tables in big query sql

I have a 2 tables with the same structure:

table1

id text        var  
1  "bla bla"   100
1  "blabla1"    30
2  "qweweqty"    0
2    etc...
7
3
3
1
..
100

table2

id text        var  
101 "bla bla"   10
101  "bla1"      60
101  "bla"    5
103    etc...
102
103
102
110
..
200

I want to randomly sample data from table1 and table2 based on id. so basically, sample every observation for a random sample of ids from table1 and and every observation from random sample of ids from table2 so that 50 ids are from table 1 and 50 are from table 2 . any idea on how to do this on big query SQL?

en example would be the following where I want to get 3 ids from table1 and 3 from table2

randomly ids 1,2,3 are selected from table1, ids 101, 110 and 103 are selected from table2

the resulting table is then:

id. text var
 1.   ..  ..
 1
 2
 2
 3
 3
 1
 101
 101
 101
 103
 103
 110

so basically any observation from table1 with id 1,2,3 and any observation from table2 with id 101, 103, 110 are selected and put in the same table: so the passages are two: first randomly select a certain number of ids from table1, and a certain number of ids from table2, then I select any observation corresponding to those ids from both tables and I join them in the same table

Advertisement

Answer

If you want 50 ids from each table, then you can limit them using subqueries:

select t1.*
from t1 join
     (select distinct id
      from t1
      order by rand()
     ) ids
     on t1.id = ids.id
union all
select t2.* 
from t2 join
     (select distinct id
      from t2
      order by rand()
     ) ids
     on t2.id = ids.id

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement