Skip to content
Advertisement

Pyspark: How to flatten nested arrays by merging values in spark

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark?

EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well.

Notes:

  • Input dataframe has more than 10000 columns name_1_a, name_1000_xx so column(array) names can not be hardcoded as it will requires to write 10000 names
  • id, date, val has always the same naming convention across all columns and all jsons
  • array size can vary but date, val are always there so they can be hardcoded
  • date can be different in each array, for example name_1_a starts with 2001, but name_10000_xvz for id == 1 starts with 2000 and finnish with 2004, however for id == 2 starts with 1990 and finish with 2004

Input df:

Required output df:

To reproduce input df:

Useful links:

Advertisement

Answer

UPDATE

As @werner has mentioned, it’s necessary to transform all structs to append the column name into it.

OLD

Assuming:

  • date value is always the same value all columns
  • name_1_a, name_1_b, name_2_a their sizes are equals
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement