So I have 2 Json arrays that need unnesting, and joining based on a key within the json structure. In theory is easy, but without having a ‘left join unnest’ functionality, it all becomes messy.
I have achieved what I want, by grouping the results; but I also have concerns that it is doing 2 cross joins, effectively generating many thousands of superfluous rows (in a live environment) before filtering them back out again.
Hence, my question here, is really looking for a much more efficient strategy to do the same logic. I’m well aware that my Presto experience & knowledge is n its infancy !
Thanks for any guidance !
Workings:
Basic logic : Each item in the ‘left’ array has an $.id value. For some of the ‘left’ items, there will be a matching right item with $.a.id value
Examples :
- The first SQL & results below show the setup, if not the desired result.
- The second set, shows my current solution.
(1) Raw results of Cross Join
with cte as ( Select 123 as record_id, '[ {"id":"01","key1":["val1"]}, {"id":"02","key1":["val2"]}, {"id":"03","key1":["val3"]} ]' as "left", '[ {"a":{"id":"02","key1":["apples"]}, "b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]}, "b":{"lala":"oneone"}} ]' as "right" ) select record_id, l.i as "left", r.i as "right", json_extract(l.i, '$.id') as left_id, json_extract(r.i, '$.a.id') as right_id from cte, unnest(cast (json_parse("left") as array(json))) as l(i), -- left array unnest(cast (json_parse("right") as array(json))) as r(i) -- right array
Output:
record_id | left | right | left_id | right_id |
---|---|---|---|---|
123 | {“id”:”01″,”key1″:[“val1”]} | {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} | “01” | “02” |
123 | {“id”:”01″,”key1″:[“val1”]} | {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} | “01” | “01” |
123 | {“id”:”02″,”key1″:[“val2”]} | {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} | “02” | “02” |
123 | {“id”:”02″,”key1″:[“val2”]} | {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} | “02” | “01” |
123 | {“id”:”03″,”key1″:[“val3”]} | {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} | “03” | “02” |
123 | {“id”:”03″,”key1″:[“val3”]} | {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} | “03” | “01” |
(2) Current Solution
select record_id, l.i as "left", max( if(json_extract(l.i, '$.id') = json_extract(r.i, '$.a.id'),json_format(r.i),null) )as match from cte, unnest(cast (json_parse("left") as array(json))) as l(i), -- left array unnest(cast (json_parse("right") as array(json))) as r(i) -- right array group by record_id, l.i
record_id | left | match |
---|---|---|
123 | {“id”:”01″,”key1″:[“val1”]} | {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} |
123 | {“id”:”02″,”key1″:[“val2”]} | {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} |
123 | {“id”:”03″,”key1″:[“val3”]} |
Advertisement
Answer
Unnest both arrays in CTEs and left join CTEs, in this case you will eliminate cross join, but the code is a bit longer:
with cte as ( Select 123 as record_id, '[ {"id":"01","key1":["val1"]}, {"id":"02","key1":["val2"]}, {"id":"03","key1":["val3"]} ]' as "left", '[ {"a":{"id":"02","key1":["apples"]}, "b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]}, "b":{"lala":"oneone"}} ]' as "right" ), "left" as ( select record_id, l.i as "left", json_extract(l.i, '$.id') as left_id from cte, unnest(cast (json_parse("left") as array(json))) as l(i) -- left array ), "right" as ( select record_id, r.i as "right", json_extract(r.i, '$.a.id') as right_id from cte, unnest(cast (json_parse("right") as array(json))) as r(i) -- right array ) select l.record_id, l."left", r."right", l.left_id, r.right_id from "left" l left join "right" r on l.record_id=r.record_id and l.left_id=r.right_id
Result:
record_id | left | right | left_id | right_id |
---|---|---|---|---|
123 | {“id”:”01″,”key1″:[“val1”]} | {“a”:{“id”:”01″,”key1″:[“one”]},”b”:{“lala”:”oneone”}} | “01” | “01” |
123 | {“id”:”02″,”key1″:[“val2”]} | {“a”:{“id”:”02″,”key1″:[“apples”]},”b”:{“lala”:”bananas”}} | “02” | “02” |
123 | {“id”:”03″,”key1″:[“val3”]} | N | “03” | N |