Skip to content
Advertisement

How to leave only distinct rows in the table. Postgresql

I need to delete all duplicate in 40kk of rows. I’ve got table:

CREATE TABLE IF NOT EXISTS players (
PRIMARY KEY (account_id, match_id, id),
id bigserial,
account_id bigint,
match_id bigint,
win boolean,
... );

I’ve tried this query but after 1h of executing I’ve gave up on waiting.

DELETE FROM players WHERE players.id NOT IN 
(SELECT id FROM (
    SELECT DISTINCT ON (match_id, account_id) * 
  FROM players) as X);

Is there any other solution to delete duplicates in more optimazed way?

UPD: I need to do it just once cause I didn’t handle the duplicates at the start

Advertisement

Answer

The database systems do not like delete and update processes. If you have permission to control transactions on this table and if you do not care which id will be deleted you can follow this approach.

1.Create a table with unique values.

2.Switch the table names.

create  table players_tmp as select match_id, account_id, win, id from (select match_id, account_id, win, id,
rank() OVER (PARTITION BY match_id, account_id ORDER BY id) as rn
from players) r where rn = 1;

alter table players rename to players_tmp2;

alter table players_tmp rename to players;

If you have a concern about which id will be deleted then you can edit window function.

Fiddle