I have data like below
x
|-----------|-------|-------|
| grade |lecture| count |
|-----------|-------|-------|
| freshman | eng1 | 3 |
|-----------|-------|-------|
| freshman | eng2 | 4 |
|-----------|-------|-------|
| freshman | eng3 | 5 |
|-----------|-------|-------|
| senior | eng2 | 4 |
|-----------|-------|-------|
| senior | eng3 | 4 |
|-----------|-------|-------|
…and I want to create a map with lecture
as the key and count
as a value.
How can I get an output like below?
|-----------|----------------------------|
| grade | lecture per count |
|-----------|----------------------------|
| freshman | {eng1:3, eng2:4, eng3:5} |
|-----------|----------------------------|
| senior | {eng2:4, eng3:4} |
|-----------|----------------------------|
Advertisement
Answer
If you can live with count
being a string
, you probably be able to use Hive str_to_map()
function to get a desired map
. That will require a couple of preliminary steps, to reformat column values in a way accepted by it. Something like this:
select
grade,
str_to_map(course_list,',',':') lecture_count_map
from (
select
grade,
concat_ws(',',
collect_list(concat_ws(':', lecture, cast(count as string)))
) course_list
from courses
group by grade
) T;
Output:
grade lecture_count_map
1 freshman {"eng1":"3","eng2":"4","eng3":"5"}
2 senior {"eng2":"4","eng3":"5"}
Otherwise, you’re looking at writing your own UDAF or using one of the existing ones built by third-parties, at least until JIRA-4966 is resolved (although those chances are quite low after 7 years).