I have to create a query to find the gaps and islands between dates. This seems to be a standard gaps and island problem. To show my issue I will use sample of data. The queries are executed in Snowflake.
CREATE TABLE TEST (StartDate date, EndDate date); INSERT INTO TEST SELECT '8/20/2017', '8/21/2017' UNION ALL SELECT '8/22/2017', '9/22/2017' UNION ALL SELECT '8/23/2017', '9/23/2017' UNION ALL SELECT '8/24/2017', '8/26/2017' UNION ALL SELECT '8/28/2017', '9/19/2017' UNION ALL SELECT '9/23/2017', '9/27/2017' UNION ALL SELECT '9/25/2017', '10/10/2017' UNION ALL SELECT '10/17/2017','10/18/2017' UNION ALL SELECT '10/25/2017','11/3/2017' UNION ALL SELECT '11/3/2017', '11/15/2017';
This code gives me a sample of table.
Then I have the code to find gaps and islands:
SELECT MIN(StartDate) AS IslandStartDate, MAX(EndDate) AS IslandEndDate FROM ( SELECT *, CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd, SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId FROM ( SELECT ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN, StartDate, EndDate, LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate FROM TEST ) Groups ) Islands GROUP BY IslandId ORDER BY IslandStartDate
The results are:
As you see the problem is for period 8/28/2017 – 9/19/2017. This period should not be a separate island, because it should be included in the period: 8/23/2017 – 9/23/2017.
Do you have any idea how I can modify my query to get the correct results (so instead 6 I should have 5 islands as 8/28/2017 – 9/19/2017 should not be island). This just example of data, so I am looking for unversal solution, but so far I have not figure out the correct approach.
Advertisement
Answer
You can express the gaps-and-islands logic like this:
select min(startdate), max(enddate) from (select t.*, sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp from (select t.*, max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate from test t ) t ) t group by grp order by min(startdate);
Here is a db<>fiddle.
The idea is to look for the maximum enddate on all the “earlier” rows. This value is used to check if there is an overlap.
So, the innermost subquery calculates the previous enddate. The middle subquery does a cumulative sum of the beginnings of groups to assign a group identifier.
The outer query just aggregates by the group identifier.