I have a dataset with booking hotels. date_in has format “yyyy-MM-dd”. I need select top 10 the most visited hotel by month.
SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits FROM ( SELECT date_ci, hotel_id, COUNT(id) AS count_visits, RANK() OVER ( PARTITION BY date_ci, hotel_id ORDER BY COUNT(id) DESC) as rank FROM ( SELECT id, hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci FROM my_database.my_tab) x ) top_visits GROUP BY date_ci, hotel_id HAVING rank <= 10;
I get the following error:
Error: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:13 Expression not in GROUP BY key ‘hotel_id’
Advertisement
Answer
Move COUNT(id) aggregation into the subquery, add group by:
SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits FROM ( SELECT date_ci, hotel_id, count_visits, RANK() OVER (PARTITION BY date_ci, hotel_id ORDER BY count_visits DESC) as rank FROM ( SELECT hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci, COUNT(id) AS count_visits FROM my_database.my_tab GROUP BY hotel_id, SUBSTRING(my_tab.date_in, 1, 7) ) x ) top_visits WHERE rank <= 10;