I have data in Hive that has two columns of interest. The first is a column (int) that represents a date (YYYYMM) and the second is a column (int) that represents a number of people for that date.
date people 202107 10 202001 2 202002 4 202003 6 201901 3 201902 5 201903 7 201809 11
I’m trying to write a query that sums up the number of people for comparison across two quarters. For example, I want the total of people that are between 202001 and 202003 vs. the total number of people between 201901 and 201903.
The result I would want is:
"2020 Q1" 12 "2019 Q1" 15
I’ve tried a couple of solutions like using a SELECT CASE however when I’m trying to do a group by you can’t use an alias like MySQL in Hive. I’m a little stuck on this and would appreciate some help.
Advertisement
Answer
You can use below SQL. It calculates quarter on the fly and then group by it.
Select CONCAT(SUBSTR(DATE,1,4),' Q', CAST( quarter(from_unixtime(unix_timestamp(`DATE`, 'yyyyMM'))) AS STRING)), SUM(PEOPLE) GROUP BY CONCAT(SUBSTR(DATE,1,4),' Q', CAST( quarter(from_unixtime(unix_timestamp(`DATE`, 'yyyyMM'))) AS STRING)) ORDER BY 1