Suppose I have a dataset like this
x
|-----------------|----------------|
| ID | Values |
|-----------------|----------------|
| 123 | aaaa |
|-----------------|----------------|
| 234 | bbb |
|-----------------|----------------|
| 123 | ab3d |
|-----------------|----------------|
| 264 | 34g3ff |
|-----------------|----------------|
| 783 | 341g5h |
|-----------------|----------------|
| 921 | 7jdfew |
|-----------------|----------------|
| 264 | 53fj |
|-----------------|----------------|
I would like to randomly select, say, 3 values from the distinct ID values. One possibility is to get a table like this
|-----------------|----------------|
| ID | Values |
|-----------------|----------------|
| 123 | aaaa |
|-----------------|----------------|
| 123 | ab3d |
|-----------------|----------------|
| 783 | 341g5h |
|-----------------|----------------|
| 921 | 7jdfew |
|-----------------|----------------|
How shall I do that in Hive?
Advertisement
Answer
Here is one option using a join and rand()
:
select id, values
from mytable t
inner join (select distinct id from mytable order by rand() limit 3) i on i.id = t.id
The subquery randomly selects 3 id
s, then the outer query brings all related rows.