I have three table
BX-Books.csv ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher BX-Book-Ratings.csv User-ID ISBN Book-Rating BX-Users.csv User-ID Location Age
I have to find most popular Author among each of the following age groups: less than 10 years 10 to 18 years 19 to 35 years 36 to 45 years 46 years and above where most popular author is one who got highest number of ratings >= 6
I have tried below code.
select *, rank() over (order by p.total_rating desc ) from ( select book_author, sum(t.book_rating) as total_rating, age_range from ( select case when bx_user.age <10 then 'Under 10' when bx_user.age between 10 and 18 then '10-18' when bx_user.age between 19 and 35 then '29-35' when bx_user.age between 36 and 45 then '36-45' when bx_user.age >45 then '46 and above' END as age_range, bx_books.book_author, bx_books_ratings.book_rating as book_rating from bx_books join bx_books_ratings on (bx_books.ISBN = bx_books_ratings.ISBN) join bx_user on (bx_user.user_id = bx_books_ratings.user_id) where bx_books_ratings.book_rating >=6)t group by t.book_author,t.age_range limit 1)p limit 1;
but it is only giving me highest rating from all.
Can anyone tell me how to approach this problem as I have used three sub queries i know it is very ugly
I want highest from each age group.
Advertisement
Answer
The main issue with your query is the missing partition by
in the row_number()
and the limit
in the subquery. In addition, you should be counting the books, not summing the ratings:
select aa.* from (select author, age_range, count(*) as num_books row_number() over (partition by age_range order by count(*) desc) as seqnum from (select (case when u.age < 10 then 'Under 10' when u.age between 10 and 18 then '10-18' when u.age between 19 and 35 then '29-35' when u.age between 36 and 45 then '36-45' when u.age > 45 then '46 and above' end) as age_range, b.book_author, b.book_rating from bx_books b join bx_books_ratings br on b.ISBN = br.ISBN join bx_user u on u.user_id = br.user_id where br.book_rating >= 6 ) b group by book_author, age_range ) aa where seqnum = 1;
I also introduced table aliases so the query is easier to write and read.
I don’t remember if Hive allows column aliases in the GROUP BY
clause. If it does, then one level of subquery can easily be removed.