Skip to content
Advertisement

Gap and Island problem – query not working for all periods

I have to create a query to find the gaps and islands between dates. This seems to be a standard gaps and island problem. To show my issue I will use sample of data. The queries are executed in Snowflake.

CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017'  UNION ALL
SELECT '8/22/2017', '9/22/2017'  UNION ALL
SELECT '8/23/2017', '9/23/2017'  UNION ALL 
SELECT '8/24/2017', '8/26/2017'  UNION ALL 
SELECT '8/28/2017', '9/19/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017';

This code gives me a sample of table.

Then I have the code to find gaps and islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            TEST
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

The results are:

enter image description here

As you see the problem is for period 8/28/2017 – 9/19/2017. This period should not be a separate island, because it should be included in the period: 8/23/2017 – 9/23/2017.

Do you have any idea how I can modify my query to get the correct results (so instead 6 I should have 5 islands as 8/28/2017 – 9/19/2017 should not be island). This just example of data, so I am looking for unversal solution, but so far I have not figure out the correct approach.

Advertisement

Answer

You can express the gaps-and-islands logic like this:

select min(startdate), max(enddate)
from (select t.*,
             sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
      from (select t.*,
                   max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
            from test t
           ) t
     ) t
group by grp
order by min(startdate);

Here is a db<>fiddle.

The idea is to look for the maximum enddate on all the “earlier” rows. This value is used to check if there is an overlap.

So, the innermost subquery calculates the previous enddate. The middle subquery does a cumulative sum of the beginnings of groups to assign a group identifier.

The outer query just aggregates by the group identifier.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement