Hive Most Popular in each group

Question

I have three table BX-Books.csv ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher BX-Book-Ratings.csv User-ID ISBN Book-Rating BX-Users.csv User-ID Location Age I have to find most &#8230;

Accepted Answer

The main issue with your query is the missing partition by in the row_number() and the limit in the subquery.  In addition, you should be counting the books, not summing the ratings:select aa.*from (select author, age_range, count(*) as num_books             row_number() over (partition by age_range order by count(*) desc) as seqnum      from (select (case when u.age < 10 then 'Under 10'                         when u.age between 10 and 18 then '10-18'                         when u.age between 19 and 35 then '29-35'                         when u.age between 36 and 45 then '36-45'                         when u.age > 45 then '46 and above'                    end) as age_range,                    b.book_author, b.book_rating            from bx_books b join                 bx_books_ratings br                 on b.ISBN = br.ISBN join                 bx_user u                 on u.user_id = br.user_id            where br.book_rating >= 6           ) b       group by book_author, age_range     ) aawhere seqnum = 1;I also introduced table aliases so the query is easier to write and read.I don&#8217;t remember if Hive allows column aliases in the GROUP BY clause.  If it does, then one level of subquery can easily be removed.

Advertisement

Answer