Suppose I have a dataset like this I would like to randomly select, say, 3 values from the distinct ID values. One possibility is to get a table like this How shall I do that in Hive? Answer Here is one option using a join and rand(): The subquery randomly selects 3 ids, then the outer query brings all relate…

Hive: randomly select N values from distinct value…

Suppose I have a dataset like this

|-----------------|----------------|
|    ID           |     Values     |
|-----------------|----------------|
|     123         |     aaaa       |
|-----------------|----------------|
|    234          |    bbb         |
|-----------------|----------------|
|     123         |     ab3d       |
|-----------------|----------------|
|    264          |     34g3ff     |
|-----------------|----------------|
|     783         |     341g5h     |
|-----------------|----------------|
|    921          |     7jdfew     |
|-----------------|----------------|
|     264         |     53fj       |
|-----------------|----------------|

​x
 
|-----------------|----------------||    ID           |     Values     ||-----------------|----------------||     123         |     aaaa       ||-----------------|----------------||    234          |    bbb         ||-----------------|----------------||     123         |     ab3d       ||-----------------|----------------||    264          |     34g3ff     ||-----------------|----------------||     783         |     341g5h     ||-----------------|----------------||    921          |     7jdfew     ||-----------------|----------------||     264         |     53fj       ||-----------------|----------------|​

I would like to randomly select, say, 3 values from the distinct ID values. One possibility is to get a table like this

|-----------------|----------------|
|    ID           |     Values     |
|-----------------|----------------|
|     123         |     aaaa       |
|-----------------|----------------|
|     123         |     ab3d       |
|-----------------|----------------|
|     783         |     341g5h     |
|-----------------|----------------|
|    921          |     7jdfew     |
|-----------------|----------------|

 
|-----------------|----------------||    ID           |     Values     ||-----------------|----------------||     123         |     aaaa       ||-----------------|----------------||     123         |     ab3d       ||-----------------|----------------||     783         |     341g5h     ||-----------------|----------------||    921          |     7jdfew     ||-----------------|----------------|​

How shall I do that in Hive?

Answer

Here is one option using a join and rand():

select id, values
from mytable t
inner join (select distinct id from mytable order by rand() limit 3) i on i.id = t.id

 
select id, valuesfrom mytable tinner join (select distinct id from mytable order by rand() limit 3) i on i.id = t.id​

The subquery randomly selects 3 ids, then the outer query brings all related rows.

Hive: randomly select N values from distinct values of one column

Advertisement

Answer