Spark SQL Partition By, Window, Order By, Count

Question

Say I have a dataframe containing magazine subscription information: subscription_id user_id created_at expiration_date 12384 1 2018-08-10 2018-12-10 &#8230;

Accepted Answer

Figured it out using PySpark:I first created another column with an array of all expiration dates for each user:joined_array = df.groupBy('user_id').agg(collect_set('expiration_date'))Then joined that array back to the original dataframe:joined_array = joined_array.toDF('user_idDROP', 'expiration_date_array')df = df.join(joined_array, df.user_id == joined_array.user_idDROP, how = 'left').drop('user_idDROP')Then created a function to iterate through array and add 1 to the count if the created date is greater than the expiration date:def check_expiration_count(created_at, expiration_array):  if not expiration_array:    return 0  else:   count = 0    for i in expiration_array:  if created_at > i:    count += 1return countcheck_expiration_count = udf(check_expiration_count, IntegerType())Then applied that function to create a new column with the correct count:df = df.withColumn('count_of_subs_ending_before_creation', check_expiration_count(df.created_at, df.expiration_array))Wala. Done. Thanks everyone (nobody helped but thanks anyway). Hope someone finds this useful in 2022

Advertisement

Answer