Note: SQL backend does not matter, any mainstream relational DB is fine (postgres, mysql, oracle, sqlserver)
There is an interesting article on Looker that tells about the technique they use to provide correct totals when a JOIN results in a fanout, along the lines of:
# In other words, using a hash to remove any potential duplicates (assuming a Primary Key). SUM(DISTINCT big_unique_number + total) - SUM(DISTINCT big_unique_number)
A good way to simulate the fanout it just doing something like this:
WITH Orders AS ( SELECT 10293 AS id, 2.5 AS rate UNION ALL SELECT 210293 AS id, 3.5 ), Other AS ( SELECT 1 UNION ALL SELECT 2 ) SELECT SUM(rate) FROM Orders CROSS JOIN Other -- Returns 12.0 instead of 6.0
Their example does something like this, which I think is just a long-form way of grabbing md5(PK)
with all the fancy footwork to get around the 8-byte limitation (so they do a LEFT(...)
then a RIGHT(...)
:
(COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0) *(1000000*1.0)) AS DECIMAL(38,0))) + CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) AS DOUBLE PRECISION) / CAST((1000000*1.0) AS DOUBLE PRECISION), 0)
Is there another general-purpose way to do this? Perhaps using a correlated subquery or something else? Or is the above way the best known way to do this?
Two related answers:
Without worrying about a general-purpose hashing function (for example, that may take strings), the following works:
WITH Orders AS ( SELECT 10293 AS id, 2.5 AS rate UNION ALL SELECT 210293 AS id, 3.5 ), Other AS ( SELECT 1 UNION ALL SELECT 2 ) SELECT SUM(DISTINCT id + rate) - SUM(DISTINCT id) FROM Orders CROSS JOIN Other -- 6.0
But this still begs the question: is there another / better way to do this in a very general-purpose manner?
Advertisement
Answer
A typical example for the joins mutilating the aggregation is this:
select posts.id, count(likes.id) as likes_total, count(dislikes.id) as dislikes_total from posts left join likes on likes.post_id = posts.post_id left join dislikes on dislikes.post_id = posts.post_id group by posts.id;
where both counts result in the same number, because each gets multiplied by the other. With 2 likes and 3 dislikes, both counts are 6.
The simple solution is: Aggregate before joining. If you want to know the likes and dislikes counts per post, join the likes and dislikes counts to the posts.
select posts.id, l.likes_total, d.dislikes_total from posts left join ( select post_id, count(*) as likes_total from likes group by post_id ) l on l.post_id = posts.post_id left join ( select post_id, count(*) as dislikes_total from dislikes group by post_id ) d on d.post_id = posts.post_id group by posts.id;
Use COALESCE
, if you want to see zeros instead of nulls.
Don’t try to muddle through with tricks. Just aggregate, then join. You can of course replace the joins with lateral joins (which are correlated subqueries), if the DBMS supports them. Or for single aggregates as in the example even move the correlated subqueries to the select clause. That’s mainly personal preference, but depending on the DBMS’s optimizer one solution may be faster than the other. (Ideally the optimizer would come up with the same execution plan for all those queries of course.)