Skip to content
Advertisement

Hive Most Popular in each group

I have three table

BX-Books.csv
ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher
BX-Book-Ratings.csv
User-ID ISBN Book-Rating
BX-Users.csv
User-ID Location Age

I have to find most popular Author among each of the following age groups: less than 10 years 10 to 18 years 19 to 35 years 36 to 45 years 46 years and above where most popular author is one who got highest number of ratings >= 6

I have tried below code.

select *, rank() over (order by p.total_rating desc )
from (
    select book_author, sum(t.book_rating) as total_rating, age_range from (
        select
            case
               when bx_user.age <10 then 'Under 10'
               when bx_user.age between 10 and 18 then '10-18'
               when bx_user.age between 19 and 35 then '29-35'
               when bx_user.age between 36 and 45 then '36-45'
               when bx_user.age >45 then '46 and above'
            END as age_range, bx_books.book_author, bx_books_ratings.book_rating as book_rating
        from bx_books join bx_books_ratings on (bx_books.ISBN = bx_books_ratings.ISBN) join bx_user on (bx_user.user_id = bx_books_ratings.user_id) where bx_books_ratings.book_rating >=6)t 
        group by t.book_author,t.age_range limit 1)p limit 1;

but it is only giving me highest rating from all.

Can anyone tell me how to approach this problem as I have used three sub queries i know it is very ugly

I want highest from each age group.

Advertisement

Answer

The main issue with your query is the missing partition by in the row_number() and the limit in the subquery. In addition, you should be counting the books, not summing the ratings:

select aa.*
from (select author, age_range, count(*) as num_books
             row_number() over (partition by age_range order by count(*) desc) as seqnum
      from (select (case when u.age < 10 then 'Under 10'
                         when u.age between 10 and 18 then '10-18'
                         when u.age between 19 and 35 then '29-35'
                         when u.age between 36 and 45 then '36-45'
                         when u.age > 45 then '46 and above'
                    end) as age_range, 
                   b.book_author, b.book_rating
            from bx_books b join
                 bx_books_ratings br
                 on b.ISBN = br.ISBN join
                 bx_user u
                 on u.user_id = br.user_id
            where br.book_rating >= 6
           ) b 
      group by book_author, age_range
     ) aa
where seqnum = 1;

I also introduced table aliases so the query is easier to write and read.

I don’t remember if Hive allows column aliases in the GROUP BY clause. If it does, then one level of subquery can easily be removed.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement