Skip to content
Advertisement

Pivot on Spark dataframe returns unexpected nulls on only one of several columns

I’ve pivoted a Spark dataframe, which works correctly for all columns except one, even though they’re all almost exactly the same.

I have a dataframe which looks like this:

+-------+------------+---------+
|     id|       cf_id|    value|
+-------+------------+---------+
|3688634|300016666252|telephone|
|3688634|300019829932|    false|
|3688307|300016666252|    email|
|3688307|300019829932|    false|
|3688635|300016666252|telephone|
|3688635|300019829932|    false|

(there are 29 distinct cf_id values, but in this example only two)

when I run:

df.groupBy("id").pivot("cf_id").agg(first("value")).show

I’d expect to see:

+-------+------------+------------+
|     id|300016666252|300019829932|
+-------+------------+------------+
|3634121|       email|       false|
|3634192|       email|        true|
|3634216|       email|        true|
|3634319|       email|        true|
|3634423|       email|       false|
|3634749|       email|        true|
|3635192|   telephone|       false|

All columns work correctly except the final one displayed here (300019829932), which converts all values to null.

I’ve extracted the single problem column as a separate dataframe and ran the pivot function on it, which transforms all the values to null.
I ran the same function on another column, which works as expected.
Both columns have a single string for the data, and their names are strings of 12 digits (which are aliased later).
I presume it can’t be due to the whole column name being digits as it works for all except one. I’ve also tried collect_set as the pivot aggregate function, which returns an empty set.

Spark version 2.2.0.2.6.4.0-91
Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)

Am I missing something here? The titles & values are very similar across columns, and the functions applied to them are exactly the same, yet the column always returns null values.

Advertisement

Answer

It looks as though the problem was due to there being true/false/null values in the column.

Somewhere in the pivot function it wasn’t handling the three values for a seemingly boolean type, and nulling everything.

So, (given a table with only boolean cf_id values), when casting the value as boolean it works.

val castdf = spark.sql("""select id, cf_id, cast(value as boolean) as value from df""")

castdf.groupBy($"id").pivot("cf_id").agg(first($"value")).show
+-------+------------+
|     id|360019829932|
+-------+------------+
|3663762|       false|
|3619941|        null|
|3667500|       false|
|3631088|        null|
|3668712|       false|
|3661298|        true|

I’m fairly new to spark and SQL, so I couldn’t explain why.
But in conclusion:
If you’re pivoting to a Spark dataframe which will have a string-type column containing true/false/null values, the column the values come from should be cast as boolean.

Thank you @rbcvl for your help

6 People found this is helpful
Advertisement