Ensuring no dupe ids in query return

Question

So for the following schema: You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query This returns: However this is incorrect. There are only 3 activities, its just that each contact sees activity 1.

Accepted Answer

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.But there is more:SELECT con.account_id AS accountid , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP)) AS last_month , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_monthFROM activity actJOIN activity_contact aco ON aco.activity_id = act.id AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP)RIGHT JOIN contact con ON con.id = aco.contact_id-- JOIN account acc ON con.account_id = acc.id -- noiseGROUP BY 1;db<>fiddle hereMost importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.We have to move that predicate to the JOIN clause, lest we’d exclude accounts with no activity. (LEFT JOIN and RIGHT JOIN can both be used, mirroring each other.)See:Postgres Left Join with where conditionExplain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detailMake that filter “sargable”, so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.Unlike other aggregate functions, count() returns 0 (not NULL) for “no rows”, so we don’t have to do anything extra.Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it.CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it’s bit more efficient to use LOCALTIMESTAMP to begin with.Compare with your original to see that this is quite a bit faster.And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:Ignoring time zones altogether in Rails and PostgreSQLIf you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:SELECT con.account_id AS accountid , date_trunc('month', act.start_date) AS mon , count(DISTINCT aco.activity_id) AS dist_countFROM activity actJOIN activity_contact aco ON aco.activity_id = act.id AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP)RIGHT JOIN contact con ON con.id = aco.contact_idGROUP BY 1, 2ORDER BY 1, 2 DESC;Again, we can include accounts without activity. But months without activity do not show up …

Advertisement

Answer