Skip to content
Advertisement

Enforce 2 unique constraints on upserts in Postgres

I have a table where I save contacts data

                                                Table "public.person"
            Column             |           Type           | Collation | Nullable |              Default               
-------------------------------+--------------------------+-----------+----------+------------------------------------
 id                            | integer                  |           | not null | nextval('person_id_seq'::regclass)
 full_name                     | character varying        |           |          | 
 role                          | character varying        |           |          | 
 first_name                    | character varying        |           |          | 
 last_name                     | character varying        |           |          | 
 linkedin_slug                 | character varying        |           |          | 
 email                         | character varying        |           |          | 
 domain                        | character varying        |           |          | 
 created_at                    | timestamp with time zone |           |          | now()
 updated_at                    | timestamp with time zone |           |          | now()
Indexes:
    "pk_person" PRIMARY KEY, btree (id)
    "ix_person_domain" btree (domain)
    "ix_person_email" btree (email)
    "ix_person_updated_at" btree (updated_at)
    "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)

I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.

And I want to do upserts to not have duplicated data. For now I’m using the constraint on full_name, domain. I know it’s an oversimplification as there may be 2 different people with the same full name in the same company, but that’s not a problem at this moment.

The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it’s the same person.

Or when they’re associated to 2 domains from the same company.

In those cases, I end up with duplicated rows for some people. For example:

full_name domain linkedin_slug
Raffi SARKISSIAN getlago.com sarkissianraffi
Raffi Sarkissian getlago.com sarkissianraffi

That one is a trivial one that could be solved making the constraint on lower(full_name), domain, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).

Another example

full_name domain linkedin_slug
Amir Manji tenjin.com amirmanji
Amir Manji tenjin.io amirmanji

Ideally I’d like to be able to enforce more than 1 constraint at the same time in Postgres, but I’ve seen it’s not easy or out-of-the-box. I don’t/can’t create a unique constraint on (full_name, domain, linkedin_slug). And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I’d have to write a different upsert function for each data source (not all of them have the same attributes)

What I’m thinking is making a script to deduplicate the info ‘manually’ after inserting new data, but I’m not sure if there are better ways to address this.

How would you go about it?

Advertisement

Answer

Update: I ended up doing it by first doing upserts enforcing full_name, domain is unique, then deduplicating on linkedin_slug running a script that basically groups by linkedin_slug and gets whatever values are not null:

    SELECT
        max(id) id
        , max(full_name) full_name
        , max("role") "role"
        , max(first_name) first_name
        , max(last_name) last_name
        , linkedin_slug
        , max(linkedin_id) linkedin_id
        , max(email) email
        , max("domain") "domain"
        , max(yc_bio) yc_bio
        , min(created_at) created_at
        , now() updated_at
        , max(extrapolated_email_confidence) extrapolated_email_confidence
        , max(email_status) email_status
        , max(email_searched_on_apollo::text)::bool email_searched_on_apollo
    FROM person
    GROUP BY linkedin_slug
    HAVING count(*) > 1

And then updating the original table with data from this subquery.

The full gist is here

9 People found this is helpful
Advertisement