My PostgreSQL database stores school vacation, public holidays and weekend dates for parents to plan their vacation. Many times school vacations are adjourned by weekends or public holidays. I want to display the total number of non-school days for a school vacation. That should include any adjourned weekend or public holiday.
Example Data
locations
SELECT id, name, is_federal_state FROM locations WHERE is_federal_state = true;
| id | name | is_federal_state | |----|-------------------|------------------| | 2 | Baden-Württemberg | true | | 3 | Bayern | true |
holiday_or_vacation_types
SELECT id, name FROM holiday_or_vacation_types;
| id | name | |----|-----------------------| | 1 | Herbst | | 8 | Wochenende |
“Herbst” is German for “autumn” and “Wochenende” is German for “weekend”.
periods
SELECT id, starts_on, ends_on, holiday_or_vacation_type_id FROM periods WHERE location_id = 2 ORDER BY starts_on;
| id | starts_on | ends_on | holiday_or_vacation_type_id | |-----|--------------|--------------|-----------------------------| | 670 | "2019-10-26" | "2019-10-27" | 8 | | 532 | "2019-10-28" | "2019-10-30" | 1 | | 533 | "2019-10-31" | "2019-10-31" | 1 | | 671 | "2019-11-02" | "2019-11-03" | 8 | | 672 | "2019-11-09" | "2019-11-10" | 8 | | 673 | "2019-11-16" | "2019-11-17" | 8 |
Task
I want to select
all periods
where location_id
equals 2. And I want to calculate the duration of each period in days. That can be done with this SQL query:
SELECT id, starts_on, ends_on, (ends_on - starts_on + 1) AS duration, holiday_or_vacation_type_id FROM periods
| id | starts_on | ends_on | duration | holiday_or_vacation_type_id | |-----|--------------|--------------|----------|-----------------------------| | 670 | "2019-10-26" | "2019-10-27" | 2 | 8 | | 532 | "2019-10-28" | "2019-10-30" | 3 | 1 | | 533 | "2019-10-31" | "2019-10-31" | 1 | 1 | | 671 | "2019-11-02" | "2019-11-03" | 2 | 8 | | 672 | "2019-11-09" | "2019-11-10" | 2 | 8 | | 673 | "2019-11-16" | "2019-11-17" | 2 | 8 |
Any human looking at the calendar would see that the ids 670 (weekend), 532 (fall vacation) and 533 (fall vacation) are adjourned. So they add up to a 6 day vacation period. So far I do this with a program which computes this. But that takes quite a lot of resources (the actual table contains some 500,000 items).
Problem 1
Which SQL query would result in the following output (is adds a real_duration
column)? Is that even possible with SQL?
| id | starts_on | ends_on | duration | real_duration | holiday_or_vacation_type_id | |-----|--------------|--------------|----------|---------------|-----------------------------| | 670 | "2019-10-26" | "2019-10-27" | 2 | 6 | 8 | | 532 | "2019-10-28" | "2019-10-30" | 3 | 6 | 1 | | 533 | "2019-10-31" | "2019-10-31" | 1 | 6 | 1 | | 671 | "2019-11-02" | "2019-11-03" | 2 | 2 | 8 | | 672 | "2019-11-09" | "2019-11-10" | 2 | 2 | 8 | | 673 | "2019-11-16" | "2019-11-17" | 2 | 2 | 8 |
Problem 2
It is possible to list the adjourning periods in a part_of_range
field? This would be the result. Can that be done with SQL?
| id | starts_on | ends_on | duration | part_of_range | holiday_or_vacation_type_id | |-----|--------------|--------------|----------|---------------|-----------------------------| | 670 | "2019-10-26" | "2019-10-27" | 2 | 670,532,533 | 8 | | 532 | "2019-10-28" | "2019-10-30" | 3 | 670,532,533 | 1 | | 533 | "2019-10-31" | "2019-10-31" | 1 | 670,532,533 | 1 | | 671 | "2019-11-02" | "2019-11-03" | 2 | | 8 | | 672 | "2019-11-09" | "2019-11-10" | 2 | | 8 | | 673 | "2019-11-16" | "2019-11-17" | 2 | | 8 |
Advertisement
Answer
This is a gaps and islands problem. In this case you can use lag()
to see where an island starts and then a cumulative sum.
The final operation is some aggregation (using window functions):
SELECT p.*, (Max(ends_on) OVER (PARTITION BY location_id, grp) - Min(starts_on) OVER (PARTITION BY location_id, grp) ) + 1 AS duration, Array_agg(p.id) OVER (PARTITION BY location_id) FROM (SELECT p.*, Count(*) FILTER (WHERE prev_eo < starts_on - INTERVAL '1 day') OVER (PARTITION BY location_id ORDER BY starts_on) AS grp FROM (SELECT id, starts_on, ends_on, location_id, holiday_or_vacation_type_id, lag(ends_on) OVER (PARTITION BY location_id ORDER BY (starts_on)) AS prev_eo FROM periods ) p ) p;