Skip to content
Advertisement

Finding largest number of location IDs per hour from each zone

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this below:

What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location

So the answer should look something like this:

What I first tried was to use an intermediate step to figure out the counts per zone and hour

This gives me a dataframe that looks like this:

I then tried doing the following:

This doesn’t do anything except just grouping by hours and zone but I still have 1000s of rows. I also tried:

The above gives me the max count and 24 rows from 0-23 but there is no Zone column there. So the answer looks like this:

I would like the Zone column included so I know which zone had the max count for each of those hours. I was also looking into the window function to do rank but I wasn’t sure how to use it.

Advertisement

Answer

You can use spark window functions for this task.

At first you can group by the data to get a count of number of zones.

Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.

This will perform the operation and list out the rank of all the zones grouped by hour.

The output will be something like this:

You can then filter out the data with rank 1.

You can check my code here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement