Skip to content
Advertisement

Athena/Presto : Unnest 2 arrays with left join

So I have 2 Json arrays that need unnesting, and joining based on a key within the json structure. In theory is easy, but without having a ‘left join unnest’ functionality, it all becomes messy.

I have achieved what I want, by grouping the results; but I also have concerns that it is doing 2 cross joins, effectively generating many thousands of superfluous rows (in a live environment) before filtering them back out again.

Hence, my question here, is really looking for a much more efficient strategy to do the same logic. I’m well aware that my Presto experience & knowledge is n its infancy !

Thanks for any guidance !

Workings:

Basic logic : Each item in the ‘left’ array has an $.id value. For some of the ‘left’ items, there will be a matching right item with $.a.id value

Examples :

  1. The first SQL & results below show the setup, if not the desired result.
  2. The second set, shows my current solution.

(1) Raw results of Cross Join

with cte as (
    Select 
        123 as record_id,
        '[ {"id":"01","key1":["val1"]}, {"id":"02","key1":["val2"]}, {"id":"03","key1":["val3"]} ]' as "left",
        '[ {"a":{"id":"02","key1":["apples"]}, "b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]}, "b":{"lala":"oneone"}} ]' as "right"
)
select 
    record_id,
    l.i as "left",
    r.i as "right",
    json_extract(l.i, '$.id') as left_id,
    json_extract(r.i, '$.a.id') as right_id
from 
    cte,
    unnest(cast (json_parse("left") as array(json))) as l(i),    -- left array
    unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array

Output:

record_id left right left_id right_id
123 {“id”:”01″,”key1″:[“val1”]} {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} “01” “02”
123 {“id”:”01″,”key1″:[“val1”]} {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} “01” “01”
123 {“id”:”02″,”key1″:[“val2”]} {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} “02” “02”
123 {“id”:”02″,”key1″:[“val2”]} {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} “02” “01”
123 {“id”:”03″,”key1″:[“val3”]} {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} “03” “02”
123 {“id”:”03″,”key1″:[“val3”]} {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} “03” “01”

(2) Current Solution

select 
  record_id,
  l.i as "left",
  max( if(json_extract(l.i, '$.id') = json_extract(r.i, '$.a.id'),json_format(r.i),null) )as match
from 
  cte,
   unnest(cast (json_parse("left") as array(json))) as l(i),    -- left array
   unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array
group by 
  record_id,
  l.i 
record_id left match
123 {“id”:”01″,”key1″:[“val1”]} {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}}
123 {“id”:”02″,”key1″:[“val2”]} {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}}
123 {“id”:”03″,”key1″:[“val3”]}

Advertisement

Answer

Unnest both arrays in CTEs and left join CTEs, in this case you will eliminate cross join, but the code is a bit longer:

with cte as (
    Select 
        123 as record_id,
        '[ {"id":"01","key1":["val1"]}, {"id":"02","key1":["val2"]}, {"id":"03","key1":["val3"]} ]' as "left",
        '[ {"a":{"id":"02","key1":["apples"]}, "b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]}, "b":{"lala":"oneone"}} ]' as "right"
), 

"left" as (
select 
    record_id,
    l.i as "left",
    json_extract(l.i, '$.id') as left_id
from 
    cte,
    unnest(cast (json_parse("left") as array(json))) as l(i)    -- left array
), 

"right" as (
  select 
    record_id,
    r.i as "right",
    json_extract(r.i, '$.a.id') as right_id
from 
    cte,
    unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array
)

select 
    l.record_id,
    l."left",
    r."right",
    l.left_id,
    r.right_id
from 
    "left" l left join "right" r on l.record_id=r.record_id and l.left_id=r.right_id

Result:

record_id left right left_id right_id
123 {“id”:”01″,”key1″:[“val1”]} {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} “01” “01”
123 {“id”:”02″,”key1″:[“val2”]} {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} “02” “02”
123 {“id”:”03″,”key1″:[“val3”]} N “03” N
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement