Skip to content
Advertisement

Create hierarchical summary of XML nodes

Is there a way of using SQL or Python (or another language) to return the node-tree for an XML document?

In the XML document example below, the employee node appears twice but the second time it has more nodes within it… same with the address node. Is there a way to return the structure (without values contained within the tags) so that you can see all nodes that exist within another node? Sort of like creating a pivot table of all the field names in an excel document?

<dept>
   <dept_id>1</dept_id>
   <dept_name>Marketing</dept_name>
    <employee>
        <emp_id>1</emp_id>
        <emp_fname>Chandler</emp_fname>
        <address>
            <street_1>12345 Central Park Ave.</street_1>
            <street_2></street_2>
            <city>Hazel Crest</city>
            <state>IL</state>
            <zipcode>60429</zipcode>
        </address>
    </employee>
    <employee>
        <emp_id>2</emp_id>
        <emp_fname>Joey</emp_fname>
        <emp_lname>Tribbiani</emp_lname>
        <emp_title>Manager</emp_title>
        <address>
            <street_1>2312 28th Rd. S.</street_1>
            <street_2></street_2>
            <city>Arlington</city>
            <state>VA</state>
            <zipcode>22206</zipcode>
            <start_date>2000-03-01</start_date>
            <end_date>2012-12-31</end_date>
        </address>
    </employee>
</dept>

A useful output would be something like this, but could come in whatever form is easiest to generate:

<dept>
   <dept_id>
   <dept_name>
      <employee>
         <emp_id>
         <emp_fname>
         <emp_lname>
         <emp_title>
         <address>
            <street_1>
            <street_2>
            <city>
            <state>
            <zipcode>
            <start_date>
            <end_date>

Thank you!

Advertisement

Answer

select 
    replace(trim(regexp_replace(f.path, '['\[\]@]*' ,''), '$'), '$', ',') as _order
    ,'<'|| f.value::text || '>' as name
from data, table(flatten(input=>xml, recursive=>true)) f
where f.key = '@'
order by _order;

gives:

_ORDER NAME
null <dept>
0 <dept_id>
1 <dept_name>
2 <employee>
2,0 <emp_id>
2,1 <emp_fname>
2,2 <address>
2,2,0 <street_1>
2,2,1 <street_2>
2,2,2 <city>
2,2,3 <state>
2,2,4 <zipcode>
3 <employee>
3,0 <emp_id>
3,1 <emp_fname>
3,2 <emp_lname>
3,3 <emp_title>
3,4 <address>
3,4,0 <street_1>
3,4,1 <street_2>
3,4,2 <city>
3,4,3 <state>
3,4,4 <zipcode>
3,4,5 <start_date>
3,4,6 <end_date>

and

select 
    REPEAT(' ', regexp_count(_order, '($[0-9]+)*')-1) || name as tree
from (
    select 
        regexp_replace(f.path, '['\[\]@]*' ,'') as _order
        ,'<'|| f.value::text || '>' as name
    from data, table(flatten(input=>xml, recursive=>true)) f
    where f.key = '@'
)
order by _order
;

gives:

TREE
<dept>
      <dept_id>
      <dept_name>
      <employee>
            <emp_id>
            <emp_fname>
            <address>
                  <street_1>
                  <street_2>
                  <city>
                  <state>
                  <zipcode>
      <employee>
            <emp_id>
            <emp_fname>
            <emp_lname>
            <emp_title>
            <address>
                  <street_1>
                  <street_2>
                  <city>
                  <state>
                  <zipcode>
                  <start_date>
                  <end_date>
Advertisement