Spark: How to transpose and explode columns with dynamic nested arrays

Question

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe """{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}""" , with new column c, where array has new val_dynamic field which can appear on random basis. I'm looking for required output 2 (Transpose and

Accepted Answer

stack requires that all stacked columns have the same type. The problem here is that the structs inside of the arrays have different members. One approach would be to add the missing members to all structs so that the approach of my previous answer  works again.cols = ['a', 'b', 'c']#create a map containing all struct fields per columnexisting_fields = {c:list(map(lambda field: field.name, df.schema.fields[i].dataType.elementType.fields))       for i,c in enumerate(df.columns) if c in cols}#get a (unique) set of all fields that exist in all columnsall_fields = set(sum(existing_fields.values(),[]))#create a list of transform expressions to fill up the structs will null fieldstransform_exprs = [f"transform({c}, e -> named_struct(" +     ",".join([f"'{f}', {('e.'+f) if f in existing_fields[c] else 'cast(null as long)'}" for f in all_fields])     + f")) as {c}" for c in cols]#create a df where all columns contain arrays with the same structfull_struct_df = df.selectExpr("id", *transform_exprs)full_struct_df has now the schemaroot |-- id: long (nullable = true) |-- a: array (nullable = true) |    |-- element: struct (containsNull = false) |    |    |-- val: long (nullable = true) |    |    |-- val_dynamic: long (nullable = true) |    |    |-- date: long (nullable = true) |-- b: array (nullable = true) |    |-- element: struct (containsNull = false) |    |    |-- val: long (nullable = true) |    |    |-- val_dynamic: long (nullable = true) |    |    |-- date: long (nullable = true) |-- c: array (nullable = true) |    |-- element: struct (containsNull = false) |    |    |-- val: long (nullable = true) |    |    |-- val_dynamic: long (nullable = true) |    |    |-- date: long (nullable = true)From here the logic works as before:stack_expr = f"stack({len(cols)}," +     ",".join([f"'{c}',{c}" for c in cols]) +     ")"transpose_df = full_struct_df.selectExpr("id", stack_expr)     .withColumnRenamed("col0", "cols")     .withColumnRenamed("col1", "arrays")     .filter("not arrays is null")explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')The first part of this answer requires thateach column mentioned in cols is an array of structsall members of all structs are longs. The reason for this restriction is the cast(null as long) when creating the transform expression.

Advertisement

Answer