Gap and Island problem – query not working for all periods

I have to create a query to find the gaps and islands between dates. This seems to be a standard gaps and island problem. To show my issue I will use sample of data. The queries are executed in Snowflake.

CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017'  UNION ALL
SELECT '8/22/2017', '9/22/2017'  UNION ALL
SELECT '8/23/2017', '9/23/2017'  UNION ALL 
SELECT '8/24/2017', '8/26/2017'  UNION ALL 
SELECT '8/28/2017', '9/19/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017';

​x
 
CREATE TABLE TEST (StartDate date, EndDate date);INSERT INTO TESTSELECT '8/20/2017', '8/21/2017'  UNION ALLSELECT '8/22/2017', '9/22/2017'  UNION ALLSELECT '8/23/2017', '9/23/2017'  UNION ALL SELECT '8/24/2017', '8/26/2017'  UNION ALL SELECT '8/28/2017', '9/19/2017'  UNION ALL SELECT '9/23/2017', '9/27/2017'  UNION ALL SELECT '9/25/2017', '10/10/2017' UNION ALLSELECT '10/17/2017','10/18/2017' UNION ALL SELECT '10/25/2017','11/3/2017'  UNION ALL SELECT '11/3/2017', '11/15/2017';​

This code gives me a sample of table.

Then I have the code to find gaps and islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            TEST
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

 
SELECT    MIN(StartDate) AS IslandStartDate,    MAX(EndDate) AS IslandEndDateFROM    (    SELECT        *,        CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,        SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId    FROM    (        SELECT            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,            StartDate,            EndDate,            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate        FROM            TEST    ) Groups) IslandsGROUP BY    IslandIdORDER BY     IslandStartDate​

The results are:

As you see the problem is for period 8/28/2017 – 9/19/2017. This period should not be a separate island, because it should be included in the period: 8/23/2017 – 9/23/2017.

Do you have any idea how I can modify my query to get the correct results (so instead 6 I should have 5 islands as 8/28/2017 – 9/19/2017 should not be island). This just example of data, so I am looking for unversal solution, but so far I have not figure out the correct approach.

Answer

You can express the gaps-and-islands logic like this:

select min(startdate), max(enddate)
from (select t.*,
             sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
      from (select t.*,
                   max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
            from test t
           ) t
     ) t
group by grp
order by min(startdate);

 
select min(startdate), max(enddate)from (select t.*,             sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp      from (select t.*,                   max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate            from test t           ) t     ) tgroup by grporder by min(startdate);​

Here is a db<>fiddle.

The idea is to look for the maximum enddate on all the “earlier” rows. This value is used to check if there is an overlap.

So, the innermost subquery calculates the previous enddate. The middle subquery does a cumulative sum of the beginnings of groups to assign a group identifier.

The outer query just aggregates by the group identifier.

Advertisement

Answer