Skip to content
Advertisement

Row comparison in table via SQL

I have a table which is structured like the following:

ID        Day             Value1    Value2
10        20200601        ABC       100
10        20200602        ABC       100
10        20200603        CDE       200
10        20200604        CDE       100
20        20200601        ABC       50
20        20206002        ABC       100
20        20200603        ABC       100
20        20200604        ABC       100

Is there a way to build a SQL query which – per each ID – looks for the Day in which Value1 OR Value2 has changed? The result I would like to achieve would be this:

ID    Day         Value1      Value2
10    20200603    ABC, CDE    100, 200   
10    20200604    CDE         200, 100
20    20200602    ABC         50, 100

In which I can keep track of those changes per ID per Day.

Edit: I’m accessing this data on a Hadoop cluster via PySpark-SQL

Advertisement

Answer

IIUC you can create a window over ID and order by day to get the desired output as

window = Window.partitionBy('ID').orderBy('Day').rowsBetween(Window.currentRow-1,Window.currentRow)

df = df.select('ID','Day',f.collect_set('Value1').over(window).alias('value1'),f.collect_set('Value2').over(window).alias('value2')).filter((f.size('value1')>1) | (f.size('value2')>1))

df.show()

+---+--------+----------+----------+
| ID|     Day|    value1|    value2|
+---+--------+----------+----------+
| 20|20200602|     [ABC]| [100, 50]|
| 10|20200603|[CDE, ABC]|[100, 200]|
| 10|20200604|     [CDE]|[100, 200]|
+---+--------+----------+----------+
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement