Skip to content

How to add a ranking to a pyspark dataframe

I have a pyspark dataframe with 2 columns – id and count. I want to add a ranking to this by reverse count. So the highest count has rank 1, second highest rank 2, etc.

testDF = spark.createDataFrame([(DJS232,437232)], [“id”, “count”])

I first tried using

from pyspark.sql import functions as F
testDF.withColumn('rank', F.monotonically_increasing_id())

and this worked, ish. It had monotonically increasing id numbers but the jump from the first to the second was quite large.

|     id|count|       rank|
|ABDSDS | 1401|          0|
|FJKSDF2|  691| 8589934592|
|DJSKJ  |  436|17179869184|
|FKLDFKL|  368|25769803776|

Then I tried getting the max count from the count column and creating another column that was max-count. I thought this would be OK because the counts are not too variable and I don’t care about ties.

maxCount = testDF.agg({"count": "max"}).collect()[0]
outputDF = testDF.withColumn('rank', maxCount[0]-testDF['count'])

This worked, almost. But I found that there was at least one value where the value was negative, meaning that max didn’t get the max. (Also, I can hear my boss saying ‘that is rather hacky’)

I also tried row_count() but this caused a Java error.

Any ideas for a clean solution? The dataset is rather small, and will have max 6000 records and will eventually be inserted into an SQL database.



Try using the Window functions to create a row_number column ordered by the count column.
In this case, the window won´t be partitioned by any column since there is no aggregation but it needs to be ordered.

Maybe this will help

from pyspark.sql.window import Window
window = Window.orderBy('count') 
testDF.withColumn('rank', row_number().over(window))
User contributions licensed under: CC BY-SA
3 People found this is helpful