So for the following schema:
CREATE TABLE activity ( id integer NOT NULL, start_date date NOT NULL ); CREATE TABLE account ( id integer NOT NULL, name varchar NOT NULL ); CREATE TABLE contact ( id integer NOT NULL, account_id integer NOT NULL, name varchar NOT NULL ); CREATE TABLE activity_contact ( id integer NOT NULL, contact_id integer NOT NULL, activity_id integer NOT NULL ); insert into activity(id, start_date) values (1, '2021-11-03'), (2, '2021-10-03'), (3, '2021-11-02'); insert into account(id, name) values (1, 'Test Account'); insert into contact(id, account_id, name) values (1, 1, 'John'), (2, 1, 'Kevin'); insert into activity_contact(id, contact_id, activity_id) values (1, 1, 1), (2, 2, 1), (3, 2, 2), (4, 1, 3);
You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query
SELECT contact.account_id AS accountid, count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month, count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month FROM activity JOIN activity_contact ON activity_contact.activity_id = activity.id JOIN contact ON contact.id = activity_contact.contact_id JOIN account ON contact.account_id = account.id GROUP BY contact.account_id;
This returns:
accountid last_month prev_month 1 3 1
However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?
Advertisement
Answer
count(DISTINCT activity_id)
to fold duplicates in the count, like Edouard suggested.
But there is more:
SELECT con.account_id AS accountid , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP)) AS last_month , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month FROM activity act JOIN activity_contact aco ON aco.activity_id = act.id AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP) RIGHT JOIN contact con ON con.id = aco.contact_id -- JOIN account acc ON con.account_id = acc.id -- noise GROUP BY 1;
db<>fiddle here
Most importantly, add an outer
WHERE
clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
We have to move that predicate to theJOIN
clause, lest we’d exclude accounts with no activity. (LEFT JOIN
andRIGHT JOIN
can both be used, mirroring each other.) See:Make that filter “sargable”, so it can use an index on
(start_date)
(unlike your original formulation). Again, big impact for a small selection from a big table.Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Unlike other aggregate functions,count()
returns0
(notNULL
) for “no rows”, so we don’t have to do anything extra.Assuming referential integrity (enforced with a FK constraint), the join to table
account
is just expensive noise. Drop it.CURRENT_DATE
is not wrong. But since your expressions yieldtimestamp
anyway, it’s bit more efficient to useLOCALTIMESTAMP
to begin with.
Compare with your original to see that this is quite a bit faster.
And I assume you are aware that this query introduces a dependency on the TimeZone
setting of the executing session. The current date depends on where in the world we ask. See:
If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:
SELECT con.account_id AS accountid , date_trunc('month', act.start_date) AS mon , count(DISTINCT aco.activity_id) AS dist_count FROM activity act JOIN activity_contact aco ON aco.activity_id = act.id AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon') AND act.start_date < date_trunc('month', LOCALTIMESTAMP) RIGHT JOIN contact con ON con.id = aco.contact_id GROUP BY 1, 2 ORDER BY 1, 2 DESC;
Again, we can include accounts without activity. But months without activity do not show up …