I have a query that looks like this
SELECT u.id, COUNT(*) FROM users u, posts p WHERE u.id = p.owneruserid AND EXISTS (SELECT COUNT(*) as num FROM postlinks pl WHERE pl.postid = p.id GROUP BY pl.id HAVING num > 1) --correlated subquery 1 AND EXISTS (SELECT * FROM comments c WHERE c.postid = p.id); --correlated subquery 2 GROUP BY u.id
I researched and read that in Hive IN
or EXIST
are not supported statements. I read that a workaround for this would be to use a LEFT JOIN
. I have tried this but I am having trouble with the GROUP BY u.id
. I read that this needs to be paired always with an aggregation function like COUNT()
however I’m not sure how I can rewrite this query so as to get it to work. All the other examples I have seen online do not seem to be as complicated as this one.
Advertisement
Answer
Like you said, you can convert them to left join
or may be left join
since they uses exists
in both subquery. Simply convert your subqueries to inline view and join them with original tables.
SELECT u.id, COUNT(*) FROM users u inner join posts p on u.id = p.owneruserid left outer join (SELECT COUNT(*) as num, pl.postid postid FROM postlinks pl GROUP BY pl.postid HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id --correlated subquery 2 with left join WHERE ( c.postid is not null AND pl.postid is not null) -- this ensure data exists in both subquery GROUP BY u.id
With left join, there may be chance of duplicates, you can use group by in subqry2 to avoid it.