Below is part of a source file which you could imagine being much bigger:
date,code1,postcode,cityname,total 2020-03-27,2011,X700,Curepipe,44 2020-03-29,2011,X700,Curepipe,44 2020-03-26,2011,X700,Curepipe,22 2020-03-27,2035,X920,vacoas,3 2020-03-25,2011,X920,vacoas,1 2020-03-24,2122,X760,souillac,22 2020-03-23,2122,X760,souillac,11 2020-03-22,2257,X760,souillac,10 2020-03-27,2480,X510,rosehill,21 2020-03-22,2035,X510,rosehill,7 2020-03-20,2035,X510,rosehill,3
After the following code:
#Load data from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("source").getOrCreate() dfcases = spark.read.format("csv").option("header", "true").load("sourcefile.csv") dfcases.createOrReplaceTempView("tablecases") spark.sql(XXXXXXXXXXXXX).show() #Mysql code to insert
I would like to obtain this result:
Curepipe,X700,2020-03-27,44 Curepipe,X700,2020-03-29,44 souillac,X760,2020-03-24,22 rosehill,X510,2020-03-27,21 vacoas,X920,2020-03-27,3
The aim is to:
- Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates),
- Sort by total descending, then date ascending, then cityname ascending.
Thanks!
Advertisement
Answer
You can have your result using a SQL window in your request, as follows:
SELECT cityname, postcode, date, total FROM (SELECT cityname, postcode, date, total, MAX(total) OVER (PARTITION BY cityname ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_total FROM tablecases) WHERE max_total = total ORDER BY max_total DESC, date, cityname