How to replace exist in Hive with two correlated subqueries

Question

I have a query that looks like this I researched and read that in Hive IN or EXIST are not supported statements. I read that a workaround for this would be to use a LEFT JOIN. I have tried this but I am having trouble with the GROUP BY u.id. I read that this needs to be paired always with

Accepted Answer

Like you said, you can convert them to left join or may be left join since they uses exists in both subquery.  Simply convert your subqueries to inline view and join them with original tables.SELECT u.id, COUNT(*)FROM users uinner join  posts p on u.id = p.owneruserid left outer join (SELECT COUNT(*) as num, pl.postid postid              FROM postlinks pl               GROUP BY pl.postid              HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id  --correlated subquery 2  with left join WHERE ( c.postid is not null AND pl.postid is not null)  -- this ensure data exists in both subquery GROUP BY u.idWith left join, there may be chance of duplicates, you can use group by in subqry2 to avoid it.

Advertisement

Answer