I have a massive table with over 1.3B records. This table is based on web event level data. It has a uniqye key in mongo_id. The table has many duplicate records; but for these duplicate records, the mongo_id is still different. So in other words, there are duplicate records where except for the mongo_id field, everything else is duplicated.
I’d like to delete these records from the table.
If all the fields in the table were duplicated, I could have used something like this to delete the duplicates:
create table table_name_new as select distinct * from table_name; alter table table_name rename to table_name_old; alter table table_name_new rename to table_name; drop table table_name_old;
But since I have duplicate records in all fields but mongo_id, I do not know how to identify and delete duplicates.
Any help in this matter would be greatly appreciated.
Advertisement
Answer
I would suggest recreating the table:
create table new_t as 
    select t.*
    from (select <all other columns>,
                 row_number() over (partition by <all other columns> order by mongo_id) as seqnum
          from t
         ) t
    where seqnum = 1;
You can truncate the existing table and then copy these results into it, if you must put the data back in place.
Deleting lots of rows in a table can be much more expensive than using a query and saving the results.