Skip to content
Advertisement

Should I always prefer EXISTS over COUNT() > 0 in SQL?

I often encounter the advice that, when checking for the existence of any rows from a (sub)query, one should use EXISTS instead of COUNT(*) > 0, for reasons of performance. Specifically, the former can short-circuit and return TRUE (or FALSE in the case of NOT EXISTS) after finding a single row, while COUNT needs to actually evaluate each row in order to return a number, only to be compared to zero.

This all makes perfect sense to me in simple cases. However, I recently ran into a problem where I needed to filter groups in the HAVING clause of a GROUP BY, based on whether all values in a certain column of the group were NULL.

For the sake of clarity, let’s see an example. Let’s say I have the following schema:

I.e. each user (table not shown for brevity) has 0 or more profiles. Each profile is either a Google or a Facebook account. (This is the translation of subclasses or a sum type with some associated data — in my real schema, the account IDs are also foreign keys to different tables holding that associated data, but this is not relevant to my question.)

Now, say I wanted to count the Facebook profiles for all users who do NOT have any Google profiles.

At first, I wrote the following query using COUNT() = 0:

But then it occurred to me that the condition in the HAVING clause is actually just an existence check. So I then re-wrote the query using a subquery and NOT EXISTS:

My question is two-fold:

  1. Should I keep the second, re-formulated query, and use NOT EXISTS with a subquery instead of COUNT() = 0? Is this really more efficient? I reckon that the index lookup due to the WHERE p.user_id = q.user_id condition has some additional cost. Whether this additional cost is absorbed by the short-circuiting behavior of EXISTS could as well depend on the average cardinality of the groups, could it not?

    Or could the DBMS perhaps be smart enough to recognize the fact that the grouping key is being compared against, and optimize this subquery away completely, by replacing it with the current group (instead of actually performing an index lookup for each group)? I seriously doubt that a DBMS could optimize away this subquery, while failing to optimize COUNT() = 0 into NOT EXISTS.

  2. Efficiency aside, the second query seems significantly more convoluted and less obviously correct to me, so I’d be reluctant to use it even if it happened to be faster. What do you think, is there a better way? Could I have my cake and eat it too, by using NOT EXISTS in a simpler manner, for instance by directly referencing the current group from within the HAVING clause?

Advertisement

Answer

You should prefer EXISTS/NOT EXISTS over COUNT() in a subquery. So instead of:

You should instead use:

The reasoning for this is that the subquery can stop processing at the first match.

This reasoning doesn’t apply in a HAVING clause after an aggregation — all the rows have to be generated anyway so there is little value in stopping at the first match.

However, aggregation might not be necessary if you have a users table and don’t really need the facebook count. You could use:

Also, the aggregation might be faster if you filter before the aggregation:

Whether it actually is faster depends on a number of factors, including the number of rows that are actually filtered out.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement