So for the following schema:
CREATE TABLE activity (
id integer NOT NULL,
start_date date NOT NULL
);
CREATE TABLE account (
id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE contact (
id integer NOT NULL,
account_id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE activity_contact (
id integer NOT NULL,
contact_id integer NOT NULL,
activity_id integer NOT NULL
);
insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');
insert into account(id, name)
values
(1, 'Test Account');
insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');
insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);
You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query
SELECT contact.account_id AS accountid,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;
This returns:
accountid last_month prev_month
1 3 1
However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?
Advertisement
Answer
count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:
SELECT con.account_id AS accountid
, count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)) AS last_month
, count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
RIGHT JOIN contact con ON con.id = aco.contact_id
-- JOIN account acc ON con.account_id = acc.id -- noise
GROUP BY 1;
db<>fiddle here
Most importantly, add an outer
WHEREclause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
We have to move that predicate to theJOINclause, lest we’d exclude accounts with no activity. (LEFT JOINandRIGHT JOINcan both be used, mirroring each other.) See:Make that filter “sargable”, so it can use an index on
(start_date)(unlike your original formulation). Again, big impact for a small selection from a big table.Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Unlike other aggregate functions,count()returns0(notNULL) for “no rows”, so we don’t have to do anything extra.Assuming referential integrity (enforced with a FK constraint), the join to table
accountis just expensive noise. Drop it.CURRENT_DATEis not wrong. But since your expressions yieldtimestampanyway, it’s bit more efficient to useLOCALTIMESTAMPto begin with.
Compare with your original to see that this is quite a bit faster.
And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:
If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:
SELECT con.account_id AS accountid
, date_trunc('month', act.start_date) AS mon
, count(DISTINCT aco.activity_id) AS dist_count
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
RIGHT JOIN contact con ON con.id = aco.contact_id
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
Again, we can include accounts without activity. But months without activity do not show up …