Skip to content
Advertisement

Sum of column returning all null values in PySpark SQL

I am new to Spark and this might be a straightforward problem.

I’ve a SQL with name sql_left which is in the format:

Here is a sample data generated using sql_left.take(1):

Note: Age column has ‘XXX’,’NUll’ and other integer values as 023,034 etc.
The printSchema shows Age,Total Cas as integers.

I’ve tried the below code to first join two tables:

And below code to generate Total Cas:

The output I’m getting is below with sum as 0.

I’ve tried multiple options, however nothing worked out. My main problem here is Total_casualities is returning 0.0 for all rows if I use COALESCE(sum("Total Cas"),0). If I don’t use COALESCE, it is displaying values as NULL.

Help is much appreciated.

Advertisement

Answer

Instead of specifying Total Cas in double-quotes(“Total Cas”), mention it in backticks.

Note: The column name with space in between needs to be specified with backticks. As you are mentioning under quotations, it considers it as a string, that’s why you are not getting sum. Also, for other columns(like Licence Type,Unit Type), it’s displaying the same as a string instead of its value. Hope you got it.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement