Skip to content
Advertisement

temporal joins in hierarchical query

I want to join various nodes of a tree, making sure the the returned root-to-leaf path is temporally valid. The tricky part is that the data source is dated with validity from-to dates.

ID NVALUE VFROM VTO
1 A 2021-01-01 2021-01-31
1 B 2021-02-01 2021-02-28
2 C 2021-01-01 2021-02-28
3 D 2021-01-01 2021-01-31
3 E 2021-02-01 2021-02-28

the links are trivially pointing to the node ids (but not their dates!)

LINK_CHILD LINK_PARENT
1 2
2 3

from this I would like to return the valid paths and their validity dates:

  1. A-C-D valid from 2021-01-01 to 2021-01-31
  2. B-C-E valid from 2021-02-01 to 2021-02-28

invalid paths (e.g. A-C-E should not be returned, since there is no moment in time in which all the three nodes are valid).

The issue I have with this is that the “overlap” check is not transitive (so A overlaps with B and B overlaps with C does not imply that A overlaps with C). So when writing the connect by query each level overlaps with the next, but the resulting global path is invalid.

the basic query set up I have is

with src_nodes (id, nvalue, vfrom, vto) as (
    select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
    select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
    select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
    select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
    select 3, 'E', date '2021-02-01', date '2021-02-28'
    from dual
),
     src_links(link_child, link_parent) as (
         select 1, 2 from dual union all
         select 2, 3 from dual
     ),
     full_links as (
         select c.*
         from src_links c
         union
         select null, link_child
         from src_links a
         where not exists(select null from src_links b where b.link_parent = a.link_child)
     ),
     nodes_and_links as (
         select *
         from full_links a
                  join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
           greatest(prior nl.vfrom, nl.vfrom) < 
           least(prior nl.vto, nl.vto)

Advertisement

Answer

I believe this is the most efficient way. One recursive with query and another trivial query to get the leaves.

here’s an example with a more complex data source:

dbfiddle

with src_nodes as (
    select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
    from dual
    union all
    select 1, 'B', date '2021-02-15', date '2021-02-28'
    from dual
    union all
    select 2, 'C', date '2021-01-01', date '2021-01-31'
    from dual
    union all
    select 2, 'D', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 3, 'E', date '2021-01-01', date '2021-02-28'
    from dual
    union all
    select 4, 'F', date '2021-01-01', date '2021-01-31'
    from dual
    union all
    select 4, 'G', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 5, 'H', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 6, 'I', date '2021-02-10', date '2021-02-28'
    from dual

),
     src_links as (
         select 1 link_child, 2 link_parent
         from dual
         union all
         select 2, 3
         from dual
         union all
         select 3, 4
         from dual
         union all
         select 5, 6
         from dual
     ),
     -- use "recursive with" method instead of "connect by" to be able to
     -- refine the validity dates as we walk the tree
     hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
         select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
         from src_nodes sn
         where -- start with nodes that have no incoming parent link
             exists(select null from src_links a where a.link_child = sn.id)
           and not exists(select null from src_links a where a.link_parent = sn.id)
         union all
         select sn.id,
                greatest(sn.vfrom, hier.vfrom),
                least(sn.vto, hier.vto),
                sn.nvalue,
                hier.lvl + 1 lvl,
                hier.root_id,
                hier.tpath || '-' || sn.nvalue
         from hier
                  join src_links ln on ln.link_child = hier.id
                  join src_nodes sn on sn.id = ln.link_parent --
             and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
     ) -- use "depth first" to be able to detect leaf nodes
         search depth first by id set seq,
     hier_leaves as (
         select *
         from (
                  select a.*,
                         -- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
                         (case lead(a.lvl) over (order by a.seq) - a.lvl
                              when 1 then 'inner'
                              else 'leaf' end) path_type
                  from hier a)
         where path_type = 'leaf')
select hl.tpath, hl.vfrom, hl.vto
from hier_leaves hl;

I have now tested this approach against our data which have 300K nodes and 240K links, and the trees (plus some additional pivoting) are parsed in 6 seconds. Similar work was done in 10 minutes by the ETLs.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement