Skip to content
Advertisement

Spark SQL Partition By, Window, Order By, Count

Say I have a dataframe containing magazine subscription information:

Now I want to add a column that states how many previous subscriptions a user had that expired before this current subscription began. In other words, how many expiration dates associated with a given user were before this subscription’s start date. Here is the full desired output:

Attempts:

EDIT: Tried a variety of lag/lead/etc using Python and I’m now thinking this is a SQL problem

<— EDIT, EDIT: Never mind, this doesn’t work

I think I exhausted the lag/lead/shift method and found it doesn’t work. I’m now thinkings it would be best to do this using Spark SQL, perhaps with a case when to produce the new column, combined with a having count, grouped by ID?

Advertisement

Answer

Figured it out using PySpark:

I first created another column with an array of all expiration dates for each user:

Then joined that array back to the original dataframe:

Then created a function to iterate through array and add 1 to the count if the created date is greater than the expiration date:

Then applied that function to create a new column with the correct count:

Wala. Done. Thanks everyone (nobody helped but thanks anyway). Hope someone finds this useful in 2022

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement