I have a massive table with over 1.3B records. This table is based on web event level data. It has a uniqye key in mongo_id
. The table has many duplicate records; but for these duplicate records, the mongo_id
is still different. So in other words, there are duplicate records where except for the mongo_id
field, everything else is duplicated.
I’d like to delete these records from the table.
If all the fields in the table were duplicated, I could have used something like this to delete the duplicates:
create table table_name_new as select distinct * from table_name; alter table table_name rename to table_name_old; alter table table_name_new rename to table_name; drop table table_name_old;
But since I have duplicate records in all fields but mongo_id
, I do not know how to identify and delete duplicates.
Any help in this matter would be greatly appreciated.
Advertisement
Answer
I would suggest recreating the table:
create table new_t as select t.* from (select <all other columns>, row_number() over (partition by <all other columns> order by mongo_id) as seqnum from t ) t where seqnum = 1;
You can truncate the existing table and then copy these results into it, if you must put the data back in place.
Deleting lots of rows in a table can be much more expensive than using a query and saving the results.