I often encounter the advice that, when checking for the existence of any rows from a (sub)query, one should use EXISTS
instead of COUNT(*) > 0
, for reasons of performance. Specifically, the former can short-circuit and return TRUE
(or FALSE
in the case of NOT EXISTS
) after finding a single row, while COUNT
needs to actually evaluate each row in order to return a number, only to be compared to zero.
This all makes perfect sense to me in simple cases. However, I recently ran into a problem where I needed to filter groups in the HAVING
clause of a GROUP BY
, based on whether all values in a certain column of the group were NULL
.
For the sake of clarity, let’s see an example. Let’s say I have the following schema:
CREATE TABLE profile( id INTEGER PRIMARY KEY, user_id INTEGER NOT NULL, google_account_id INTEGER NULL, facebook_account_id INTEGER NULL, FOREIGN KEY (user_id) REFERENCES user(id), CHECK( (google_account_id IS NOT NULL) + (facebook_account_id IS NOT NULL) = 1 ) )
I.e. each user (table not shown for brevity) has 0 or more profiles. Each profile is either a Google or a Facebook account. (This is the translation of subclasses or a sum type with some associated data — in my real schema, the account IDs are also foreign keys to different tables holding that associated data, but this is not relevant to my question.)
Now, say I wanted to count the Facebook profiles for all users who do NOT have any Google profiles.
At first, I wrote the following query using COUNT() = 0
:
SELECT user_id, COUNT(facebook_account_id) FROM profile GROUP BY user_id HAVING COUNT(google_account_id) = 0;
But then it occurred to me that the condition in the HAVING
clause is actually just an existence check. So I then re-wrote the query using a subquery and NOT EXISTS
:
SELECT user_id, COUNT(facebook_account_id) FROM profile AS p GROUP BY user_id HAVING NOT EXISTS ( SELECT 1 FROM profile AS q WHERE p.user_id = q.user_id AND q.google_id IS NOT NULL )
My question is two-fold:
Should I keep the second, re-formulated query, and use
NOT EXISTS
with a subquery instead ofCOUNT() = 0
? Is this really more efficient? I reckon that the index lookup due to theWHERE p.user_id = q.user_id
condition has some additional cost. Whether this additional cost is absorbed by the short-circuiting behavior ofEXISTS
could as well depend on the average cardinality of the groups, could it not?Or could the DBMS perhaps be smart enough to recognize the fact that the grouping key is being compared against, and optimize this subquery away completely, by replacing it with the current group (instead of actually performing an index lookup for each group)? I seriously doubt that a DBMS could optimize away this subquery, while failing to optimize
COUNT() = 0
intoNOT EXISTS
.Efficiency aside, the second query seems significantly more convoluted and less obviously correct to me, so I’d be reluctant to use it even if it happened to be faster. What do you think, is there a better way? Could I have my cake and eat it too, by using
NOT EXISTS
in a simpler manner, for instance by directly referencing the current group from within the HAVING clause?
Advertisement
Answer
You should prefer EXISTS
/NOT EXISTS
over COUNT()
in a subquery. So instead of:
select t.* from t where (select count(*) from z where z.x = t.x) > 0
You should instead use:
select t.* from t where exists (select 1 from z where z.x = t.x)
The reasoning for this is that the subquery can stop processing at the first match.
This reasoning doesn’t apply in a HAVING
clause after an aggregation — all the rows have to be generated anyway so there is little value in stopping at the first match.
However, aggregation might not be necessary if you have a users
table and don’t really need the facebook count. You could use:
select u.* from users u where not exists (select 1 from profiles p where p.user_id = u.user_id and p.google_id is not null );
Also, the aggregation might be faster if you filter before the aggregation:
SELECT user_id, COUNT(facebook_account_id) FROM profile AS p WHERE NOT EXISTS ( SELECT 1 FROM profile p2 WHERE p2.user_id = p.user_id AND p2.google_id IS NOT NULL ) GROUP BY user_id;
Whether it actually is faster depends on a number of factors, including the number of rows that are actually filtered out.