Finding largest number of location IDs per hour from each zone

Question

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this

Accepted Answer

You can use spark window functions for this task.At first you can group by the data to get a count of number of zones.val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)val rankZone = rank().over(byZoneName)This will perform the operation and list out the rank of all the zones grouped by hour.val result_df = df.select($"*", rankZone as "rank")The output will be something like this:+----+----+-----------+----+|hour|zone|count_order|rank|+----+----+-----------+----+|   0|   A|          3|   1||   0|   C|          2|   2||   0|   B|          1|   3||   3|   A|          1|   1||   5|   B|          2|   1||   6|   D|          1|   1|+----+----+-----------+----+You can then filter out the data with rank 1.result_df.filter($"rank" === 1).orderBy("hour").show()You can check my code here:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html

Advertisement

Answer