Skip to content
Advertisement

Grouping by a Range of Numbers with Aggregate Function

I have data in Hive that has two columns of interest. The first is a column (int) that represents a date (YYYYMM) and the second is a column (int) that represents a number of people for that date.

date    people
202107  10
202001  2
202002  4
202003  6
201901  3
201902  5
201903  7
201809  11

I’m trying to write a query that sums up the number of people for comparison across two quarters. For example, I want the total of people that are between 202001 and 202003 vs. the total number of people between 201901 and 201903.

The result I would want is:

"2020 Q1" 12
"2019 Q1" 15

I’ve tried a couple of solutions like using a SELECT CASE however when I’m trying to do a group by you can’t use an alias like MySQL in Hive. I’m a little stuck on this and would appreciate some help.

Advertisement

Answer

You can use below SQL. It calculates quarter on the fly and then group by it.

Select 
CONCAT(SUBSTR(DATE,1,4),' Q', CAST(
quarter(from_unixtime(unix_timestamp(`DATE`, 'yyyyMM'))) AS STRING)), SUM(PEOPLE)
GROUP BY 
CONCAT(SUBSTR(DATE,1,4),' Q', CAST(
quarter(from_unixtime(unix_timestamp(`DATE`, 'yyyyMM'))) AS STRING))
ORDER BY 1
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement