I am new to Spark and this might be a straightforward problem.
I’ve a SQL with name sql_left which is in the format:
Here is a sample data generated using sql_left.take(1):
[Row(REPORT_ID='2016-30-15/08/2019', Stats Area='2 Metropolitan', Suburb='GREENACRES', Postcode=5086, LGA Name='CITY OF PORT ADELAIDE ENFIELD', Total Units=3, Total Cas=0, Total Fats=0, Total SI=0, Total MI=0, Year=2016, Month='November', Day='Wednesday', Time='01:20 am', Area Speed=50, Position Type='Not Divided', Horizontal Align='Straight road', Vertical Align='Level', Other Feat='Not Applicable', Road Surface='Sealed', Moisture Cond='Dry', Weather Cond='Not Raining', DayNight='Night', Crash Type='Hit Parked Vehicle', Unit Resp=1, Entity Code='Driver Rider', CSEF Severity='1: PDO', Traffic Ctrls='No Control', DUI Involved=None, Drugs Involved=None, ACCLOC_X=1331135.04, ACCLOC_Y=1677256.22, UNIQUE_LOC=13311351677256, REPORT_ID='2016-30-15/08/2019', Unit No=2, No Of Cas=0, Veh Reg State='UNKNOWN', Unit Type='Motor Vehicle - Type Unknown', Veh Year='XXXX', Direction Of Travel='East', Sex=None, Age=None, Lic State=None, Licence Class=None, Licence Type=None, Towing='Unknown', Unit Movement='Parked', Number Occupants='000', Postcode=None, Rollover=None, Fire=None)]
Note: Age column has ‘XXX’,’NUll’ and other integer values as 023,034 etc.
The printSchema shows Age,Total Cas as integers.
I’ve tried the below code to first join two tables:
sql_left = spark.sql(''' SELECT * FROM sql_crash c Left JOIN sql_units u ON c.REPORT_ID=u.REPORT_ID''') sql_left.createOrReplaceTempView("mytable")
And below code to generate Total Cas:
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,"Licence Type","Unit Type",Sex,COALESCE(sum("Total Cas"),0) as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,"Licence Type","Unit Type",Sex order by Total_casualities desc''') sql_result.show(20,truncate=False)
The output I’m getting is below with sum as 0.
+--------------------------------+---+------------+---------+-------+-----------------+ |Date_Time |Age|Licence Type|Unit Type|Sex |Total_casualities| +--------------------------------+---+------------+---------+-------+-----------------+ |Friday December 2016 02:45 pm |XXX|Licence Type|Unit Type|Unknown|0.0 | |Saturday September 2017 06:35 pm|023|Licence Type|Unit Type|Male |0.0 | +--------------------------------+---+------------+---------+-------+-----------------+
I’ve tried multiple options, however nothing worked out.
My main problem here is Total_casualities is returning 0.0 for all rows if I use COALESCE(sum("Total Cas"),0)
. If I don’t use COALESCE, it is displaying values as NULL.
Help is much appreciated.
Advertisement
Answer
Instead of specifying Total Cas in double-quotes(“Total Cas”), mention it in backticks.
i.e. `Total Cas`
Note: The column name with space in between needs to be specified with backticks. As you are mentioning under quotations, it considers it as a string, that’s why you are not getting sum. Also, for other columns(like Licence Type
,Unit Type
), it’s displaying the same as a string instead of its value. Hope you got it.
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,`Licence Type`,`Unit Type`,Sex,**sum(`Total Cas`)** as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,`Licence Type`,`Unit Type`,Sex order by Total_casualities desc''')