How to assign lineage of dates from 2 columns correctly to each other?

Question

I have one table which looks like this: When visualizing the date overlaps, it looks like this: As you can see, contract_id=2 is not really considered a continuous contract to contract_id=1 but &#8230;

Accepted Answer

One approach is to recursively iterate over the data, starting with the origin row of the lineage, processing only those following rows which begin no sooner than one month before the end of the last found lineage row (your allowed overlap).  Assign a sequence value to each of these.  Once that is complete, we can union the rest of the rows not in lineage, marking them with a sequence value of 0, ordering the entire result by start_date.Here&#8217;s an attempt that uses fairly standard SQL:-- Updated to handle a one month contract in the valid lineage-- If more than one can exist for the same month,-- we'll need to order processing more carefully,-- make assumptions about id ordering or add our own guaranteed-- sequence before beginning the recursive part of the search.WITH RECURSIVE cte1 (id, start_date, end_date, seq) AS (       (SELECT t.id, t.start_date, t.end_date, 1 FROM contracts AS t ORDER BY t.start_date LIMIT 1) UNION ALL       (SELECT t.id, t.start_date, t.end_date, seq + 1          FROM cte1      AS t0          JOIN contracts AS t            ON t.start_date = (SELECT MIN(start_date) FROM contracts                                WHERE start_date >= (t0.end_date - INTERVAL '1' MONTH)                                  AND t.id <> t0.id                              )       )     )   , allrows (id, start_date, end_date, seq) AS (        SELECT t.id, t.start_date, t.end_date, 0          FROM contracts AS t         WHERE t.id NOT IN (SELECT id FROM cte1)         UNION        SELECT * FROM cte1     )SELECT *  FROM allrows ORDER BY start_date;Result:+------+------------+------------+------+| id   | start_date | end_date   | seq  |+------+------------+------------+------+|    1 | 2020-01-01 | 2021-01-01 |    1 ||    2 | 2020-03-01 | 2020-10-01 |    0 ||    3 | 2021-01-01 | 2021-03-01 |    2 ||    4 | 2021-03-02 | 2021-03-02 |    3 |  -- to handle this problem case.+------+------------+------------+------+If the set of contracts contains limited overlap, there could be more direct ways to generate the result without recursion or iteration.  But without making any assumptions about your data beyond what has been specified, the above seems reasonable.There&#8217;s more analysis that can be discussed.Update: The LIMIT clause could be replaced with FETCH FIRST ROW, from standard SQL.I don&#8217;t have access to snowflake, but here&#8217;s a working fiddle using PostgreSQL:Full working test case with PostgreSQLI don&#8217;t know if your data will contain the following problem contracts, but this will handle more than one one month contract (with valid lineage) starting on the same month.  This would have caused problems with the first solution (infinite recursion).  Added a baseseq to control the processing order and guarantee the search ignores prior searched contracts.WITH RECURSIVE contract_seq (id, start_date, end_date, baseseq) AS (        SELECT t.*, ROW_NUMBER() OVER (ORDER BY start_date, id) FROM contracts AS t     )   , cte1 (id, start_date, end_date, baseseq, seq) AS (       (SELECT t.id, t.start_date, t.end_date, t.baseseq, 1 FROM contract_seq AS t WHERE t.baseseq = 1) UNION ALL       (SELECT t.id, t.start_date, t.end_date, t.baseseq, seq + 1          FROM cte1         AS t0          JOIN contract_seq AS t            ON t.baseseq    = (SELECT MIN(baseseq) FROM contract_seq                                WHERE start_date >= (t0.end_date - INTERVAL '1' MONTH)                                  AND baseseq > t0.baseseq                              )       )     )   , allrows (id, start_date, end_date, baseseq, seq) AS (        SELECT t.id, t.start_date, t.end_date, t.baseseq, 0          FROM contract_seq AS t         WHERE t.id NOT IN (SELECT id FROM cte1)         UNION        SELECT * FROM cte1     )SELECT *  FROM allrows ORDER BY start_date, baseseq;Result:+------+------------+------------+---------+------+| id   | start_date | end_date   | baseseq | seq  |+------+------------+------------+---------+------+|    1 | 2020-01-01 | 2021-01-01 |       1 |    1 ||    2 | 2020-03-01 | 2020-10-01 |       2 |    0 ||    3 | 2021-01-01 | 2021-03-01 |       3 |    2 ||    4 | 2021-03-02 | 2021-03-02 |       4 |    3 | -- Problem case 1.|    5 | 2021-03-02 | 2021-03-02 |       5 |    4 | -- Problem case 2.+------+------------+------------+---------+------+This intentionally does not assume id is in increasing date order, and instead creates a separate baseseq that guarantees that order.Here&#8217;s the updated test case:Full test case with added problem casesThere&#8217;s another problem edge case, where two or more contracts start on the same date, but only one is considered in the valid lineage, we would adjust the ordering (baseseq) to process them in the required order.  Since I&#8217;m already beyond any explicit detail provided in the question, I&#8217;ll leave that one alone, for now.  There&#8217;s a simple change to address this issue (by adjusting baseseq), but needs more detail to be provided in the question.

Advertisement

Answer