So, I have a DataFrame of this type:
+-------------------+ | date| +-------------------+ |2020-05-10 22:40:51| |2020-05-10 23:05:25| |2020-05-10 22:49:42| |2020-05-10 23:16:06| |2020-05-10 22:33:25| +-------------------+
And I want to create multiple columns containing, for each line, the current day, week, month and year from a certain date(simply a year, like 2020 for 2020-01-01). At first I thought of using something like this line of code
dataframe = df.withColumn('year', F.year('date')-initial_date).withColumn('month', F.month('date')+col(year)*12).withColumn('week', F.weekofyear('date')++col(year)*52).withColumn('day', F.dayofyear('date')+col(year)*365)
unfortunately this wouldn’t work (except for year and month) correctly since my dataset spans through several years and some of them have more weeks (53,52) and more days (365,366). I know I could quite surely do something like that with an UDF but I want to use it only as a last resort since my dataset will be quite big and I’d love not to sacrifice performances.
Advertisement
Answer
To calculate the day difference, you can use datediff
, and from that, you can get the week difference by dividing the number of days by 7, and rounding down to integer:
import pyspark.sql.functions as F initial_date = 2020 df2 = df.withColumn( 'year', F.year('date')-initial_date ).withColumn( 'month', F.month('date')+F.col('year')*12 # or you can use # F.months_between('date', F.lit('%s-01-01'%initial_date)).cast('int') ).withColumn( 'day', F.datediff('date', F.lit('%s-01-01'%initial_date)) ).withColumn( 'week', (F.col('day') / 7).cast('int') ) df2.show() +-------------------+----+-----+---+----+ | date|year|month|day|week| +-------------------+----+-----+---+----+ |2020-05-10 22:40:51| 0| 5|130| 18| |2020-05-10 23:05:25| 0| 5|130| 18| |2020-05-10 22:49:42| 0| 5|130| 18| |2020-05-10 23:16:06| 0| 5|130| 18| |2020-05-10 22:33:25| 0| 5|130| 18| +-------------------+----+-----+---+----+