I have a table where I save contacts data
Table "public.person" Column | Type | Collation | Nullable | Default -------------------------------+--------------------------+-----------+----------+------------------------------------ id | integer | | not null | nextval('person_id_seq'::regclass) full_name | character varying | | | role | character varying | | | first_name | character varying | | | last_name | character varying | | | linkedin_slug | character varying | | | email | character varying | | | domain | character varying | | | created_at | timestamp with time zone | | | now() updated_at | timestamp with time zone | | | now() Indexes: "pk_person" PRIMARY KEY, btree (id) "ix_person_domain" btree (domain) "ix_person_email" btree (email) "ix_person_updated_at" btree (updated_at) "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)
I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.
And I want to do upserts to not have duplicated data. For now I’m using the constraint on full_name, domain
. I know it’s an oversimplification as there may be 2 different people with the same full name in the same company, but that’s not a problem at this moment.
The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it’s the same person.
Or when they’re associated to 2 domains from the same company.
In those cases, I end up with duplicated rows for some people. For example:
full_name | domain | linkedin_slug |
---|---|---|
Raffi SARKISSIAN | getlago.com | sarkissianraffi |
Raffi Sarkissian | getlago.com | sarkissianraffi |
That one is a trivial one that could be solved making the constraint on lower(full_name), domain
, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).
Another example
full_name | domain | linkedin_slug |
---|---|---|
Amir Manji | tenjin.com | amirmanji |
Amir Manji | tenjin.io | amirmanji |
Ideally I’d like to be able to enforce more than 1 constraint at the same time in Postgres, but I’ve seen it’s not easy or out-of-the-box. I don’t/can’t create a unique constraint on (full_name, domain, linkedin_slug)
. And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I’d have to write a different upsert function for each data source (not all of them have the same attributes)
What I’m thinking is making a script to deduplicate the info ‘manually’ after inserting new data, but I’m not sure if there are better ways to address this.
How would you go about it?
Advertisement
Answer
Update: I ended up doing it by first doing upserts enforcing full_name, domain
is unique, then deduplicating on linkedin_slug
running a script that basically groups by linkedin_slug
and gets whatever values are not null:
SELECT max(id) id , max(full_name) full_name , max("role") "role" , max(first_name) first_name , max(last_name) last_name , linkedin_slug , max(linkedin_id) linkedin_id , max(email) email , max("domain") "domain" , max(yc_bio) yc_bio , min(created_at) created_at , now() updated_at , max(extrapolated_email_confidence) extrapolated_email_confidence , max(email_status) email_status , max(email_searched_on_apollo::text)::bool email_searched_on_apollo FROM person GROUP BY linkedin_slug HAVING count(*) > 1
And then updating the original table with data from this subquery.
The full gist is here