SQL Spark – Lag vs first row by Group

I’m SQL newbie and I’m trying to calculate difference between the averages. I want for each item and year calculate difference between months, but I want always substract current average – fist month of this year/item.
So for example, Item A, 2020, month =3 will have output = 0.6 because 3.7 – 3.1 (as this is the value for first month of this item) – is this correct approach? or maybe lag fuction is the wrong solution?

So far I was trying this one, but it calculates the difference vs previous row instead always using first month row.

SELECT item, 
       year, 
       month,  
       average,
       average - COALESCE(LAG(average,1) OVER(
           PARTITION BY item
           ORDER BY year,`month` ),`average`)
AS difference FROM xxx

​x
 
SELECT item,        year,        month,         average,       average - COALESCE(LAG(average,1) OVER(           PARTITION BY item           ORDER BY year,`month` ),`average`)AS difference FROM xxx​

TABLE:

+------+------+-------+---------+--------+
| item | year | month | average | wanted OUTPUT |
+------+------+-------+---------+--------+
| a    | 2020 | 1     | 3.1     | 0      |
+------+------+-------+---------+--------+
| a    | 2020 | 2     | 3.5     | 0.4    |
+------+------+-------+---------+--------+
| a    | 2020 | 3     | 3.7     | 0.6    |
+------+------+-------+---------+--------+
| a    | 2020 | 4     | 4.1     | 1.0    |
+------+------+-------+---------+--------+
| b    | 2020 | 1     | 2.0     | 0      |
+------+------+-------+---------+--------+
| b    | 2020 | 2     | 2.1     | 0.1    |
+------+------+-------+---------+--------+
| b    | 2020 | 3     | 2.5     | 0.5    |
+------+------+-------+---------+--------+
| b    | 2020 | 4     | 4.0     | 2.0    |
+------+------+-------+---------+--------+

 
+------+------+-------+---------+--------+| item | year | month | average | wanted OUTPUT |+------+------+-------+---------+--------+| a    | 2020 | 1     | 3.1     | 0      |+------+------+-------+---------+--------+| a    | 2020 | 2     | 3.5     | 0.4    |+------+------+-------+---------+--------+| a    | 2020 | 3     | 3.7     | 0.6    |+------+------+-------+---------+--------+| a    | 2020 | 4     | 4.1     | 1.0    |+------+------+-------+---------+--------+| b    | 2020 | 1     | 2.0     | 0      |+------+------+-------+---------+--------+| b    | 2020 | 2     | 2.1     | 0.1    |+------+------+-------+---------+--------+| b    | 2020 | 3     | 2.5     | 0.5    |+------+------+-------+---------+--------+| b    | 2020 | 4     | 4.0     | 2.0    |+------+------+-------+---------+--------+​

Answer

If I follow you correctly, you want window function FIRST_VALUE() instead of LAG():

SELECT item, 
       year, 
       month,  
       average,
       average - FIRST_VALUE(average) OVER(PARTITION BY item, year ORDER BY month) AS difference 
FROM xxx

 
SELECT item,        year,        month,         average,       average - FIRST_VALUE(average) OVER(PARTITION BY item, year ORDER BY month) AS difference FROM xxx​

This compares the average of each row against the value of the first month of the same item and year.

Advertisement

Answer