Removing duplicates in all fields but one, from a table in Redshift

I have a massive table with over 1.3B records. This table is based on web event level data. It has a uniqye key in mongo_id. The table has many duplicate records; but for these duplicate records, the mongo_id is still different. So in other words, there are duplicate records where except for the mongo_id field, everything else is duplicated. I’d like to delete these records from the table.

If all the fields in the table were duplicated, I could have used something like this to delete the duplicates:

create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;

But since I have duplicate records in all fields but mongo_id, I do not know how to identify and delete duplicates.

Any help in this matter would be greatly appreciated.

Answer

I would suggest recreating the table:

create table new_t as 
    select t.*
    from (select <all other columns>,
                 row_number() over (partition by <all other columns> order by mongo_id) as seqnum
          from t
         ) t
    where seqnum = 1;

You can truncate the existing table and then copy these results into it, if you must put the data back in place.

Deleting lots of rows in a table can be much more expensive than using a query and saving the results.

Advertisement

Answer