Splitting each Multi-Category columns to Multiple columns with counts

Question

date Value1 Value2 Value3 16-08-2022 a b e 16-08-2022 a b f 16-08-2022 c d f output date Value1_a Value1_c Value2_b Value2_d Value3_e Value3_f 16-08-2022 2 1 2 1 1 2 continues like this for more columns maybe 10, I will aggregate on date and split the categorical columns with counts for each category , curren…

Accepted Answer

You can create an array of structs for the fields except the date field. Use inline to explode and create new columns, and then concatenate the field name and field value.data_sdf.     withColumn('val_struct_arr',               func.array(*[func.struct(func.lit(k).alias('name'), func.col(k).alias('val'))                             for k in data_sdf.columns if k != 'dt']                          )               ).     selectExpr('dt', 'inline(val_struct_arr)').     withColumn('name_val_concat', func.concat_ws('_', 'name', 'val')).     groupBy('dt').     pivot('name_val_concat').     count().     fillna(0).     show()# +----------+--------+--------+--------+--------+--------+--------+# |        dt|value1_a|value1_c|value2_b|value2_d|value3_e|value3_f|# +----------+--------+--------+--------+--------+--------+--------+# |16-08-2022|       2|       1|       2|       1|       1|       2|# +----------+--------+--------+--------+--------+--------+--------+

Advertisement

Answer