Skip to content
Advertisement

How to replace exist in Hive with two correlated subqueries

I have a query that looks like this

SELECT u.id, COUNT(*)
FROM users u, posts p
WHERE u.id = p.owneruserid 
  AND EXISTS (SELECT COUNT(*) as num
              FROM postlinks pl
              WHERE pl.postid = p.id
              GROUP BY pl.id
              HAVING num > 1) --correlated subquery 1
  AND EXISTS (SELECT *
              FROM comments c
              WHERE c.postid = p.id); --correlated subquery 2
GROUP BY u.id

I researched and read that in Hive IN or EXIST are not supported statements. I read that a workaround for this would be to use a LEFT JOIN. I have tried this but I am having trouble with the GROUP BY u.id. I read that this needs to be paired always with an aggregation function like COUNT() however I’m not sure how I can rewrite this query so as to get it to work. All the other examples I have seen online do not seem to be as complicated as this one.

Advertisement

Answer

Like you said, you can convert them to left join or may be left join since they uses exists in both subquery. Simply convert your subqueries to inline view and join them with original tables.

SELECT u.id, COUNT(*)
FROM users u
inner join  posts p on u.id = p.owneruserid 
left outer join (SELECT COUNT(*) as num, pl.postid postid
              FROM postlinks pl 
              GROUP BY pl.postid
              HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join 
left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id  --correlated subquery 2  with left join 
WHERE ( c.postid is not null AND pl.postid is not null)  -- this ensure data exists in both subquery 
GROUP BY u.id

With left join, there may be chance of duplicates, you can use group by in subqry2 to avoid it.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement