Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date

Spark.sql Filter rows by MAX

Below is part of a source file which you could imagine being much bigger:

date,code1,postcode,cityname,total
2020-03-27,2011,X700,Curepipe,44
2020-03-29,2011,X700,Curepipe,44
2020-03-26,2011,X700,Curepipe,22
2020-03-27,2035,X920,vacoas,3
2020-03-25,2011,X920,vacoas,1
2020-03-24,2122,X760,souillac,22
2020-03-23,2122,X760,souillac,11
2020-03-22,2257,X760,souillac,10
2020-03-27,2480,X510,rosehill,21
2020-03-22,2035,X510,rosehill,7
2020-03-20,2035,X510,rosehill,3

​x
 
date,code1,postcode,cityname,total2020-03-27,2011,X700,Curepipe,442020-03-29,2011,X700,Curepipe,442020-03-26,2011,X700,Curepipe,222020-03-27,2035,X920,vacoas,32020-03-25,2011,X920,vacoas,12020-03-24,2122,X760,souillac,222020-03-23,2122,X760,souillac,112020-03-22,2257,X760,souillac,102020-03-27,2480,X510,rosehill,212020-03-22,2035,X510,rosehill,72020-03-20,2035,X510,rosehill,3​

After the following code:

#Load data
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("source").getOrCreate()
dfcases = spark.read.format("csv").option("header", "true").load("sourcefile.csv")

dfcases.createOrReplaceTempView("tablecases")
spark.sql(XXXXXXXXXXXXX).show() #Mysql code to insert

 
#Load datafrom pyspark.sql import SparkSessionspark = SparkSession.builder.master("local").appName("source").getOrCreate()dfcases = spark.read.format("csv").option("header", "true").load("sourcefile.csv")​dfcases.createOrReplaceTempView("tablecases")spark.sql(XXXXXXXXXXXXX).show() #Mysql code to insert​

I would like to obtain this result:

Curepipe,X700,2020-03-27,44
Curepipe,X700,2020-03-29,44
souillac,X760,2020-03-24,22
rosehill,X510,2020-03-27,21
vacoas,X920,2020-03-27,3

 
Curepipe,X700,2020-03-27,44Curepipe,X700,2020-03-29,44souillac,X760,2020-03-24,22rosehill,X510,2020-03-27,21vacoas,X920,2020-03-27,3​

The aim is to:

Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates),
Sort by total descending, then date ascending, then cityname ascending.

Thanks!

Answer

You can have your result using a SQL window in your request, as follows:

SELECT
  cityname, 
  postcode, 
  date, 
  total
FROM
 (SELECT 
    cityname, 
    postcode, 
    date, 
    total, 
    MAX(total) OVER (PARTITION BY cityname ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_total 
  FROM tablecases)
WHERE max_total = total
ORDER BY max_total DESC, date, cityname

 
SELECT  cityname,   postcode,   date,   totalFROM (SELECT     cityname,     postcode,     date,     total,     MAX(total) OVER (PARTITION BY cityname ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_total   FROM tablecases)WHERE max_total = totalORDER BY max_total DESC, date, cityname​

Advertisement

Answer