Skip to content
Advertisement

How do I find first value in every last 3 months in Hive

I have a table like below.

+-------+-------+--------------+---------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value |
+-------+-------+--------------+---------------+
| AE    | A1    | 2019-12-01   |             1 |
| AE    | A1    | 2020-01-01   |             3 |
| AE    | A1    | 2020-02-01   |             5 |
| AE    | A1    | 2020-03-01   |             7 |
| AE    | A1    | 2020-04-01   |            12 |
| AE    | A1    | 2020-05-01   |            14 |
| AE    | A1    | 2020-06-01   |            11 |
| AE    | A1    | 2020-07-01   |            15 |
+-------+-------+--------------+---------------+

I need to get the first Refresh_value (based on Refresh_date) from last 3 months starting from the last date and there should be 2 additional columns (Group and Refresh_Value_Min) where 1st column will have the first value from every last 3 months and another column will have values which says in which group these dates fall into.

Expected output

+-------+-------+--------------+---------------+-------+-------------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value | Group | Refresh_Value_Min |
+-------+-------+--------------+---------------+-------+-------------------+
| AE    | A1    | 2019-12-01   |             1 | Grp3  |                 1 |
| AE    | A1    | 2020-01-01   |             3 | Grp3  |                 1 |
| AE    | A1    | 2020-02-01   |             5 | Grp2  |                 5 |
| AE    | A1    | 2020-03-01   |             7 | Grp2  |                 5 |
| AE    | A1    | 2020-04-01   |            12 | Grp2  |                 5 |
| AE    | A1    | 2020-05-01   |            14 | Grp1  |                14 |
| AE    | A1    | 2020-06-01   |            11 | Grp1  |                14 |
| AE    | A1    | 2020-07-01   |            15 | Grp1  |                14 |
+-------+-------+--------------+---------------+-------+-------------------+

I tried the below code that will give the value of the 3rd last month in the current row, but I need the output as like above.

first_value(Refresh_Value) over (partition by col_1,col_2 order by Refresh_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

Can someone help in this.

Please let me know if there are any questions.

Advertisement

Answer

Let me explain the approach (tiny details might differ):

  1. Get last date in each row
with data_with_last_dt as (
  select Col_1, Col_2, Refresh_Date, Refresh_Value,
         max(Refresh_Date) over (partition by Col_1, Col_2) as Last_Date
    from target_table
),
  1. Get the difference in months and divide it by 3 (integer division) — you’ll get the group number
data_with_group as (
  select Col_1, Col_2, Refresh_Date, Refresh_Value,
         cast(months_between(Last_Date, Refresh_Date) as int) / 3 as Group_Id
    from data_with_last_dt 
)
  1. Find the first Refresh_Value within each group:
select Col_1, Col_2, Refresh_Date, Refresh_Value, Group_Id,
       min(Refresh_Value) over(partition by Col_1, Col_2, Group_Id order by Refresh_Date) as Refresh_Value_Min 
  from data_with_group 
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement