Only show rows in a table if something changed in previous row

Question

I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-&#8230;

Accepted Answer

You can create a lagged array column of all columns of interest and compare it to the current row, then do a filter:from pyspark.sql import functions as F, Windowcols = df.columns[3:]w = Window.partitionBy('ID').orderBy('Date')df2 = df.withColumn(    'diff',     F.coalesce(        F.lag(F.array(*cols)).over(w) != F.array(*cols),         F.lit(True)    # take care of first row where the lag is null    )).filter('diff').drop('diff')df2.show()+---+----------+---+----+----+----+----+----+|Row|      Date| ID|Col1|Col2|Col3|Col4|Col5|+---+----------+---+----+----+----+----+----+|  1|01-01-2021|  1|   a|   b|   c|   d|   e||  2|02-01-2021|  1|   a|   b|   c|   d|   x||  5|01-01-2021|  2|   a|   b|   c|   d|   e||  6|02-01-2021|  2|   a|   b|   x|   d|   e||  8|01-01-2021|  3|   a|   b|   c|   d|   e|+---+----------+---+----+----+----+----+----+

Row	Date	ID	Col1	Col2	Col3	Col4	Col5
1	01-01-2021	1	a	b	c	d	e
2	02-01-2021	1	a	b	c	d	x
3	03-01-2021	1	a	b	c	d	x
4	04-01-2021	1	a	b	c	d	x
5	01-01-2021	2	a	b	c	d	e
6	02-01-2021	2	a	b	x	d	e
7	03-01-2021	2	a	b	x	d	e
8	01-01-2021	3	a	b	c	d	e
9	02-01-2021	3	a	b	c	d	e
10	03-01-2021	3	a	b	c	d	e

Advertisement

Answer