Skip to content
Advertisement

How to transform data into a map using group by in Hive SQL?

I have data like below

|-----------|-------|-------|
|   grade   |lecture| count |
|-----------|-------|-------|
|  freshman | eng1  |   3   |
|-----------|-------|-------|
|  freshman | eng2  |   4   |
|-----------|-------|-------|
|  freshman | eng3  |   5   |
|-----------|-------|-------|
|  senior   | eng2  |   4   |
|-----------|-------|-------|
|  senior   | eng3  |   4   |
|-----------|-------|-------|

…and I want to create a map with lecture as the key and count as a value. How can I get an output like below?

|-----------|----------------------------|
|   grade   |    lecture per count       |
|-----------|----------------------------|
|  freshman | {eng1:3, eng2:4, eng3:5}   |
|-----------|----------------------------|
|  senior   | {eng2:4, eng3:4}           |
|-----------|----------------------------|

Advertisement

Answer

If you can live with count being a string, you probably be able to use Hive str_to_map() function to get a desired map. That will require a couple of preliminary steps, to reformat column values in a way accepted by it. Something like this:

select 
  grade, 
  str_to_map(course_list,',',':') lecture_count_map
from (
  select 
    grade, 
    concat_ws(',',
      collect_list(concat_ws(':', lecture, cast(count as string)))
    ) course_list
  from courses
  group by grade
) T;

Output:

    grade       lecture_count_map
1   freshman    {"eng1":"3","eng2":"4","eng3":"5"}
2   senior      {"eng2":"4","eng3":"5"}

Otherwise, you’re looking at writing your own UDAF or using one of the existing ones built by third-parties, at least until JIRA-4966 is resolved (although those chances are quite low after 7 years).

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement