Skip to content
Advertisement

Hive: randomly select N values from distinct values of one column

Suppose I have a dataset like this

|-----------------|----------------|
|    ID           |     Values     |
|-----------------|----------------|
|     123         |     aaaa       |
|-----------------|----------------|
|    234          |    bbb         |
|-----------------|----------------|
|     123         |     ab3d       |
|-----------------|----------------|
|    264          |     34g3ff     |
|-----------------|----------------|
|     783         |     341g5h     |
|-----------------|----------------|
|    921          |     7jdfew     |
|-----------------|----------------|
|     264         |     53fj       |
|-----------------|----------------|

I would like to randomly select, say, 3 values from the distinct ID values. One possibility is to get a table like this

|-----------------|----------------|
|    ID           |     Values     |
|-----------------|----------------|
|     123         |     aaaa       |
|-----------------|----------------|
|     123         |     ab3d       |
|-----------------|----------------|
|     783         |     341g5h     |
|-----------------|----------------|
|    921          |     7jdfew     |
|-----------------|----------------|

How shall I do that in Hive?

Advertisement

Answer

Here is one option using a join and rand():

select id, values
from mytable t
inner join (select distinct id from mytable order by rand() limit 3) i on i.id = t.id

The subquery randomly selects 3 ids, then the outer query brings all related rows.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement