I want to join various nodes of a tree, making sure the the returned root-to-leaf path is temporally valid. The tricky part is that the data source is dated with validity from-to dates.
ID | NVALUE | VFROM | VTO |
---|---|---|---|
1 | A | 2021-01-01 | 2021-01-31 |
1 | B | 2021-02-01 | 2021-02-28 |
2 | C | 2021-01-01 | 2021-02-28 |
3 | D | 2021-01-01 | 2021-01-31 |
3 | E | 2021-02-01 | 2021-02-28 |
the links are trivially pointing to the node ids (but not their dates!)
LINK_CHILD | LINK_PARENT |
---|---|
1 | 2 |
2 | 3 |
from this I would like to return the valid paths and their validity dates:
A-C-D
valid from2021-01-01
to2021-01-31
B-C-E
valid from2021-02-01
to2021-02-28
invalid paths (e.g. A-C-E
should not be returned, since there is no moment in time in which all the three nodes are valid).
The issue I have with this is that the “overlap” check is not transitive (so A overlaps with B and B overlaps with C does not imply that A overlaps with C). So when writing the connect by
query each level overlaps with the next, but the resulting global path is invalid.
the basic query set up I have is
with src_nodes (id, nvalue, vfrom, vto) as ( select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all select 3, 'E', date '2021-02-01', date '2021-02-28' from dual ), src_links(link_child, link_parent) as ( select 1, 2 from dual union all select 2, 3 from dual ), full_links as ( select c.* from src_links c union select null, link_child from src_links a where not exists(select null from src_links b where b.link_parent = a.link_child) ), nodes_and_links as ( select * from full_links a join src_nodes n on n.id = a.link_parent) select * from nodes_and_links nl start with nl.link_child is null connect by prior nl.link_parent = nl.link_child and greatest(prior nl.vfrom, nl.vfrom) < least(prior nl.vto, nl.vto)
Advertisement
Answer
I believe this is the most efficient way. One recursive with
query and another trivial query to get the leaves.
here’s an example with a more complex data source:
with src_nodes as ( select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto from dual union all select 1, 'B', date '2021-02-15', date '2021-02-28' from dual union all select 2, 'C', date '2021-01-01', date '2021-01-31' from dual union all select 2, 'D', date '2021-02-01', date '2021-02-28' from dual union all select 3, 'E', date '2021-01-01', date '2021-02-28' from dual union all select 4, 'F', date '2021-01-01', date '2021-01-31' from dual union all select 4, 'G', date '2021-02-01', date '2021-02-28' from dual union all select 5, 'H', date '2021-02-01', date '2021-02-28' from dual union all select 6, 'I', date '2021-02-10', date '2021-02-28' from dual ), src_links as ( select 1 link_child, 2 link_parent from dual union all select 2, 3 from dual union all select 3, 4 from dual union all select 5, 6 from dual ), -- use "recursive with" method instead of "connect by" to be able to -- refine the validity dates as we walk the tree hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as ( select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || '' from src_nodes sn where -- start with nodes that have no incoming parent link exists(select null from src_links a where a.link_child = sn.id) and not exists(select null from src_links a where a.link_parent = sn.id) union all select sn.id, greatest(sn.vfrom, hier.vfrom), least(sn.vto, hier.vto), sn.nvalue, hier.lvl + 1 lvl, hier.root_id, hier.tpath || '-' || sn.nvalue from hier join src_links ln on ln.link_child = hier.id join src_nodes sn on sn.id = ln.link_parent -- and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto) ) -- use "depth first" to be able to detect leaf nodes search depth first by id set seq, hier_leaves as ( select * from ( select a.*, -- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf (case lead(a.lvl) over (order by a.seq) - a.lvl when 1 then 'inner' else 'leaf' end) path_type from hier a) where path_type = 'leaf') select hl.tpath, hl.vfrom, hl.vto from hier_leaves hl;
I have now tested this approach against our data which have 300K nodes and 240K links, and the trees (plus some additional pivoting) are parsed in 6 seconds. Similar work was done in 10 minutes by the ETLs.