I have a table with following table structure:
place_id date_start date_end 2826088480774 2017-09-19 2017-09-20 1898375544837 2017-08-01 2017-08-03 1425929142277 2017-09-23 2017-10-03 1013612281863 2016-10-12 2016-10-14 1795296329731 2016-10-13 2016-10-13 695784701956 2017-09-11 2017-11-02
I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months.
place_id
could be repeated, so I did the following query:
Select place_id, EXTRACT(MONTH FROM date_start) as month, EXTRACT(YEAR FROM date_start) as year, COUNT(*) as events From Table Group by place_id, year, month Order by month, year, events desc
So I get following grouped table:
place_id month year events 2826088480774 8 2017 345 1898375544837 8 2017 343 1425929142277 8 2017 344 1013612281863 8 2017 355 1795296329731 8 2017 348 695784701956 8 2017 363
Problem is that data is grouped only by start_date
, and its not clear for me how to distribute data by all affected months range from date_start
to date_end
.
Advertisement
Answer
You can use sequence
function to generate the dates between date_start
and date_end
then explode the generated array column and group by and count as you already did:
df.createOrReplaceTempView("EventsTable") spark.sql(""" WITH events AS ( SELECT place_id, explode(event_dates) as event_date FROM ( SELECT place_id, sequence(date_start, date_end, interval 1 day) as event_dates FROM EventsTable ) ) SELECT place_id, month(event_date) as month, year(event_date) as year, count(*) as events FROM events GROUP BY 1, 2, 3 ORDER BY month, year, events desc """).show() //+-------------+-----+----+------+ //| place_id|month|year|events| //+-------------+-----+----+------+ //|1898375544837| 8|2017| 3| //|695784701956 | 9|2017| 20| //|1425929142277| 9|2017| 8| //|2826088480774| 9|2017| 2| //|1013612281863| 10|2016| 3| //|1795296329731| 10|2016| 1| //|695784701956 | 10|2017| 31| //|1425929142277| 10|2017| 3| //|695784701956 | 11|2017| 2| //+-------------+-----+----+------+