I have a transaction table t1
in Hive that looks like this:
store_id cust_id zip_code transaction_count spend 1000 100 123 3 50 2000 200 456 2 20
I’m trying to compute some metrics using this table as follows:
SELECT store_id, zip_code, SUM(transaction_count) OVER (PARTITION BY store_id, zip_code) / COUNT(*) OVER(PARTITION BY store_id, zip_code) AS avg_transactions, SUM(spend) OVER(PARTITION BY store_id, zip_code)/SUM(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_purchase FROM t1 GROUP BY 1, 2
Hive is throwing an error
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Expression not in GROUP BY key ‘spend’
But the column spend
is being used with an aggregation function (sum), so I cannot group by this column. How can I fix this error?
Advertisement
Answer
Window functions do not need group by
. You either want:
SELECT store_id, zip_code, AVG(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_transactions, SUM(spend) OVER(PARTITION BY store_id, zip_code) / SUM(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_purchase FROM t1
Note that the first expression can be shortened with AVG()
.
This brings as many rows as in the original table. If, on the other hand, you want just one row per store and zip code, then use regular aggregate functions with GROUP BY
:
SELECT store_id, zip_code, AVG(transaction_count) AS avg_transactions, SUM(spend) / SUM(transaction_count) AS avg_purchase FROM t1 GROUP BY store_id, zip_code