I have two datatset, A and B.(A is a big dataset like personal info and B is like a small summarized demographic statistic data) I want to append summary info to A data with demographic info.
As one single row of B can matches with multiple row of A, is it right to send a hive ql like A left outer join B
?
For example, A is
x
|---------|---------|
|age_range| region |
|---------|---------|
|14-20 | 1 |
|---------|---------|
|14-20 | 1 |
|---------|---------|
|20-25 | 1 |
|---------|---------|
|20-25 | 2 |
|---------|---------|
and B is
|---------|--------|--------|
|age_range| region | ratio |
|---------|--------|--------|
|14-20 | 1 | 0.1 |
|---------|--------|--------|
|20-25 | 1 | 0.2 |
|---------|--------|--------|
Advertisement
Answer
From what you describe a join
is appropriate. If all rows in a
have a matching row in b
, then an inner join is sufficient:
select . . .
from a join
b
on <matching columns>;
You would use left join
when a
has rows that do not match in b
and you want them in the result set.