I have a pyspark dataframe with 2 columns – id and count. I want to add a ranking to this by reverse count. So the highest count has rank 1, second highest rank 2, etc.
testDF = spark.createDataFrame([(DJS232,437232)], [“id”, “count”])
I first tried using
from pyspark.sql import functions as F testDF.withColumn('rank', F.monotonically_increasing_id())
and this worked, ish. It had monotonically increasing id numbers but the jump from the first to the second was quite large.
+-------+-----+-----------+ | id|count| rank| +-------+-----+-----------+ |ABDSDS | 1401| 0| |FJKSDF2| 691| 8589934592| |DJSKJ | 436|17179869184| |FKLDFKL| 368|25769803776| +-------+-----+-----------+
Then I tried getting the max count from the count column and creating another column that was max-count. I thought this would be OK because the counts are not too variable and I don’t care about ties.
maxCount = testDF.agg({"count": "max"}).collect()[0] outputDF = testDF.withColumn('rank', maxCount[0]-testDF['count'])
This worked, almost. But I found that there was at least one value where the value was negative, meaning that max didn’t get the max. (Also, I can hear my boss saying ‘that is rather hacky’)
I also tried row_count()
but this caused a Java error.
Any ideas for a clean solution? The dataset is rather small, and will have max 6000 records and will eventually be inserted into an SQL database.
Advertisement
Answer
Try using the Window functions to create a row_number column ordered by the count column.
In this case, the window won´t be partitioned by any column since there is no aggregation but it needs to be ordered.
Maybe this will help
from pyspark.sql.window import Window window = Window.orderBy('count') testDF.withColumn('rank', row_number().over(window))