Groupby fill missing values in dataframe based on average of previous values available and next value available

Question

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column i.e. (previous value+next value)/2. I &#8230;

Accepted Answer

Perhaps this is helpful &#8211;Load the test datadf2.show(false)    df2.printSchema()    /**      * +-----+-----+      * |class|score|      * +-----+-----+      * |A    |null |      * |A    |46   |      * |A    |null |      * |A    |null |      * |A    |35   |      * |A    |null |      * |A    |null |      * |A    |null |      * |A    |46   |      * |A    |null |      * |A    |null |      * |B    |78   |      * |B    |null |      * |B    |null |      * |B    |null |      * |B    |null |      * |B    |null |      * |B    |56   |      * |B    |null |      * +-----+-----+      *      * root      * |-- class: string (nullable = true)      * |-- score: integer (nullable = true)      */Impute Null values from score columns(check new_score column)    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))      .withColumn("next", first("score", ignoreNulls = true).over(w2))      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)      .drop("next", "previous")      .show(false)    /**      * +-----+-----+---------+      * |class|score|new_score|      * +-----+-----+---------+      * |A    |null |46.0     |      * |A    |46   |46.0     |      * |A    |null |40.5     |      * |A    |null |40.5     |      * |A    |35   |35.0     |      * |A    |null |40.5     |      * |A    |null |40.5     |      * |A    |null |40.5     |      * |A    |46   |46.0     |      * |A    |null |46.0     |      * |A    |null |46.0     |      * |B    |78   |78.0     |      * |B    |null |67.0     |      * |B    |null |67.0     |      * |B    |null |67.0     |      * |B    |null |67.0     |      * |B    |null |67.0     |      * |B    |56   |56.0     |      * |B    |null |56.0     |      * +-----+-----+---------+      */

Advertisement

Answer

Load the test data

Impute Null values from score columns(check new_score column)