So, I have a DataFrame of this type:
+-------------------+
| date|
+-------------------+
|2020-05-10 22:40:51|
|2020-05-10 23:05:25|
|2020-05-10 22:49:42|
|2020-05-10 23:16:06|
|2020-05-10 22:33:25|
+-------------------+
And I want to create multiple columns containing, for each line, the current day, week, month and year from a certain date(simply a year, like 2020 for 2020-01-01). At first I thought of using something like this line of code
dataframe = df.withColumn('year', F.year('date')-initial_date).withColumn('month', F.month('date')+col(year)*12).withColumn('week', F.weekofyear('date')++col(year)*52).withColumn('day', F.dayofyear('date')+col(year)*365)
unfortunately this wouldn’t work (except for year and month) correctly since my dataset spans through several years and some of them have more weeks (53,52) and more days (365,366). I know I could quite surely do something like that with an UDF but I want to use it only as a last resort since my dataset will be quite big and I’d love not to sacrifice performances.
Advertisement
Answer
To calculate the day difference, you can use datediff
, and from that, you can get the week difference by dividing the number of days by 7, and rounding down to integer:
import pyspark.sql.functions as F
initial_date = 2020
df2 = df.withColumn(
'year',
F.year('date')-initial_date
).withColumn(
'month',
F.month('date')+F.col('year')*12
# or you can use
# F.months_between('date', F.lit('%s-01-01'%initial_date)).cast('int')
).withColumn(
'day',
F.datediff('date', F.lit('%s-01-01'%initial_date))
).withColumn(
'week',
(F.col('day') / 7).cast('int')
)
df2.show()
+-------------------+----+-----+---+----+
| date|year|month|day|week|
+-------------------+----+-----+---+----+
|2020-05-10 22:40:51| 0| 5|130| 18|
|2020-05-10 23:05:25| 0| 5|130| 18|
|2020-05-10 22:49:42| 0| 5|130| 18|
|2020-05-10 23:16:06| 0| 5|130| 18|
|2020-05-10 22:33:25| 0| 5|130| 18|
+-------------------+----+-----+---+----+