Skip to content
Advertisement

Impala SQL LEFT ANTI JOIN

Goal is to find the empid’s for a given timerange that are present in LEFT table but not in RIGHT table. I have the following two Impala queries which I ran and got different results?

QUERY 1: select count(dbonetable.empid), COUNT(DISTINCT dbtwotable.empid) from 
(select distinct dbonetable.empid 
from dbonedbtable dbonetable
WHERE (dbonetable.expiration_dt >= '2009-01-01' OR dbonetable.expiration_dt IS NULL) AND dbonetable.effective_dt <= '2019-01-01' AND dbonetable.empid IS NOT NULL) dbonetable
LEFT join dbtwodbtable dbtwotable ON dbonetable.empid = dbtwotable.empid

--43324489  43270569


QUERY 2: select count(*) from (
select distinct dbonetable.empid from dbonedbtable dbonetable
LEFT ANTI join dbtwodbtable dbtwotable ON dbonetable.empid = dbtwotable.empid
AND (dbonetable.expiration_dt >= '2009-01-01' OR dbonetable.expiration_dt IS NULL) AND dbonetable.effective_dt <= '2019-01-01' AND dbonetable.empid IS NOT NULL) tab
--19088973

--For LEFT ANTI JOIN, this clause returns those values from the left-hand table that have no matching value in the right-hand table.

To explain the Context, Query 2: Trying to find all the empid’s that are in dbonetable and are not in dbtwotable using LEFT ANTI JOIN which I learned from here: https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_joins.html –For LEFT ANTI JOIN, this clause returns those values from the left-hand table that have no matching value in the right-hand table.

And in Query 1: The dbOnetable calculated based on where clause and results from it are LEFT OUTER joined with dbtwotable, And on top of that result, I am doing a count(dbonetable.empid) and COUNT(DISTINCT dbtwotable.empid) which gave me a result as –43324489 43270569, which means 53,920. My question either my Query 1 result should be 43324489 -43270569 = 53,920 or my Query 2 Result should be 19088973.

what could be missing here, is my Query 1 is incorrect? Or is my LEFT ANTI JOIN is misleading? Thank you all in Advance.

Advertisement

Answer

It’s different because you forgot specifying “where dbtwotable.empid is null” in the query 1

Additionally, your query 2 is logically different from query 1 because in query 1, you join only on equivalence of empid1 and empid2, while in query 2 your join has much more conditions, so the tables have much fewer common entries compared to query 1, and as a result, the final count is much larger.
If you make a join condition in query 2 the same as in query 1 and put everything else into where clause, you will get the same count that you got in query 1 (updated) which is 53920. That’s the count you need

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement