Skip to content
Advertisement

Ensuring no dupe ids in query return

So for the following schema:

CREATE TABLE activity (
    id integer NOT NULL,
    start_date date NOT NULL
);

CREATE TABLE account (
    id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE contact (
    id integer NOT NULL,
    account_id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE activity_contact (
    id integer NOT NULL,
    contact_id integer NOT NULL,
    activity_id integer NOT NULL
);

insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');

insert into account(id, name)
values
(1, 'Test Account');

insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');

insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);

You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query

SELECT contact.account_id AS accountid,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;

This returns:

accountid   last_month  prev_month
    1           3           1

However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?

Advertisement

Answer

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:

SELECT con.account_id AS accountid
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP)) AS last_month
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN contact     con ON con.id = aco.contact_id
-- JOIN   account       acc ON con.account_id = acc.id  -- noise
GROUP  BY 1;

db<>fiddle here

  • Most importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
    We have to move that predicate to the JOIN clause, lest we’d exclude accounts with no activity. (LEFT JOIN and RIGHT JOIN can both be used, mirroring each other.) See:

  • Make that filter “sargable”, so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.

  • Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
    Unlike other aggregate functions, count() returns 0 (not NULL) for “no rows”, so we don’t have to do anything extra.

  • Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it. CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it’s bit more efficient to use LOCALTIMESTAMP to begin with.

Compare with your original to see that this is quite a bit faster.

And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:


If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:

SELECT con.account_id AS accountid
     , date_trunc('month', act.start_date) AS mon
     , count(DISTINCT aco.activity_id) AS dist_count
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN  contact    con ON con.id = aco.contact_id
GROUP  BY 1, 2
ORDER  BY 1, 2 DESC;

Again, we can include accounts without activity. But months without activity do not show up …

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement