I have one table which looks like this:
When visualizing the date overlaps, it looks like this:
As you can see, contract_id=2
is not really considered a continuous contract to contract_id=1
but contract_id=3
is. This is due to the 1-month overlap of end_date
of id=1
and start_date
of id=3
which indicates the true “lineage”. In general, this logic applies:
Any id which dates that are set between two other id dates and which has >1 month of overlap with the previous one (by date sequence) is NOT considered true lineage. So, for example, if a new entry id=4
is continuous with id=2
(by date), it should be eliminated because id=2
is not a valid lineage in the first place. So, there are sort of parent-child relationships to be considered.
How can I bring this information into a SQL query which will output something like this:
Also, I have thousands of such instances that can have random dates, so I am looking to incorporate a solution to would apply to all these cases, not just the one I show-cased.
I hope someone can guide what direction to go. Thanks!
Advertisement
Answer
One approach is to recursively iterate over the data, starting with the origin row of the lineage, processing only those following rows which begin no sooner than one month before the end of the last found lineage row (your allowed overlap). Assign a sequence value to each of these. Once that is complete, we can union the rest of the rows not in lineage, marking them with a sequence value of 0, ordering the entire result by start_date.
Here’s an attempt that uses fairly standard SQL:
-- Updated to handle a one month contract in the valid lineage -- If more than one can exist for the same month, -- we'll need to order processing more carefully, -- make assumptions about id ordering or add our own guaranteed -- sequence before beginning the recursive part of the search. WITH RECURSIVE cte1 (id, start_date, end_date, seq) AS ( (SELECT t.id, t.start_date, t.end_date, 1 FROM contracts AS t ORDER BY t.start_date LIMIT 1) UNION ALL (SELECT t.id, t.start_date, t.end_date, seq + 1 FROM cte1 AS t0 JOIN contracts AS t ON t.start_date = (SELECT MIN(start_date) FROM contracts WHERE start_date >= (t0.end_date - INTERVAL '1' MONTH) AND t.id <> t0.id ) ) ) , allrows (id, start_date, end_date, seq) AS ( SELECT t.id, t.start_date, t.end_date, 0 FROM contracts AS t WHERE t.id NOT IN (SELECT id FROM cte1) UNION SELECT * FROM cte1 ) SELECT * FROM allrows ORDER BY start_date ; Result: +------+------------+------------+------+ | id | start_date | end_date | seq | +------+------------+------------+------+ | 1 | 2020-01-01 | 2021-01-01 | 1 | | 2 | 2020-03-01 | 2020-10-01 | 0 | | 3 | 2021-01-01 | 2021-03-01 | 2 | | 4 | 2021-03-02 | 2021-03-02 | 3 | -- to handle this problem case. +------+------------+------------+------+
If the set of contracts contains limited overlap, there could be more direct ways to generate the result without recursion or iteration. But without making any assumptions about your data beyond what has been specified, the above seems reasonable.
There’s more analysis that can be discussed.
Update: The LIMIT clause could be replaced with FETCH FIRST ROW
, from standard SQL.
I don’t have access to snowflake, but here’s a working fiddle using PostgreSQL:
Full working test case with PostgreSQL
I don’t know if your data will contain the following problem contracts, but this will handle more than one one month contract (with valid lineage) starting on the same month. This would have caused problems with the first solution (infinite recursion). Added a baseseq to control the processing order and guarantee the search ignores prior searched contracts.
WITH RECURSIVE contract_seq (id, start_date, end_date, baseseq) AS ( SELECT t.*, ROW_NUMBER() OVER (ORDER BY start_date, id) FROM contracts AS t ) , cte1 (id, start_date, end_date, baseseq, seq) AS ( (SELECT t.id, t.start_date, t.end_date, t.baseseq, 1 FROM contract_seq AS t WHERE t.baseseq = 1) UNION ALL (SELECT t.id, t.start_date, t.end_date, t.baseseq, seq + 1 FROM cte1 AS t0 JOIN contract_seq AS t ON t.baseseq = (SELECT MIN(baseseq) FROM contract_seq WHERE start_date >= (t0.end_date - INTERVAL '1' MONTH) AND baseseq > t0.baseseq ) ) ) , allrows (id, start_date, end_date, baseseq, seq) AS ( SELECT t.id, t.start_date, t.end_date, t.baseseq, 0 FROM contract_seq AS t WHERE t.id NOT IN (SELECT id FROM cte1) UNION SELECT * FROM cte1 ) SELECT * FROM allrows ORDER BY start_date, baseseq ; Result: +------+------------+------------+---------+------+ | id | start_date | end_date | baseseq | seq | +------+------------+------------+---------+------+ | 1 | 2020-01-01 | 2021-01-01 | 1 | 1 | | 2 | 2020-03-01 | 2020-10-01 | 2 | 0 | | 3 | 2021-01-01 | 2021-03-01 | 3 | 2 | | 4 | 2021-03-02 | 2021-03-02 | 4 | 3 | -- Problem case 1. | 5 | 2021-03-02 | 2021-03-02 | 5 | 4 | -- Problem case 2. +------+------------+------------+---------+------+
This intentionally does not assume id is in increasing date order, and instead creates a separate baseseq that guarantees that order.
Here’s the updated test case:
Full test case with added problem cases
There’s another problem edge case, where two or more contracts start on the same date, but only one is considered in the valid lineage, we would adjust the ordering (baseseq) to process them in the required order. Since I’m already beyond any explicit detail provided in the question, I’ll leave that one alone, for now. There’s a simple change to address this issue (by adjusting baseseq), but needs more detail to be provided in the question.