I have a table which is structured like the following:
ID Day Value1 Value2 10 20200601 ABC 100 10 20200602 ABC 100 10 20200603 CDE 200 10 20200604 CDE 100 20 20200601 ABC 50 20 20206002 ABC 100 20 20200603 ABC 100 20 20200604 ABC 100
Is there a way to build a SQL query which – per each ID
– looks for the Day
in which Value1
OR Value2
has changed?
The result I would like to achieve would be this:
ID Day Value1 Value2 10 20200603 ABC, CDE 100, 200 10 20200604 CDE 200, 100 20 20200602 ABC 50, 100
In which I can keep track of those changes per ID per Day.
Edit: I’m accessing this data on a Hadoop cluster via PySpark-SQL
Advertisement
Answer
IIUC you can create a window over ID and order by day to get the desired output as
window = Window.partitionBy('ID').orderBy('Day').rowsBetween(Window.currentRow-1,Window.currentRow) df = df.select('ID','Day',f.collect_set('Value1').over(window).alias('value1'),f.collect_set('Value2').over(window).alias('value2')).filter((f.size('value1')>1) | (f.size('value2')>1)) df.show() +---+--------+----------+----------+ | ID| Day| value1| value2| +---+--------+----------+----------+ | 20|20200602| [ABC]| [100, 50]| | 10|20200603|[CDE, ABC]|[100, 200]| | 10|20200604| [CDE]|[100, 200]| +---+--------+----------+----------+