How to remove duplicate entries from a large Oracle table (200M rows, 20 columns)?
The below query from 2014 is slow. It took 2 minutes to delete 1 duplicate entry for one specific combination of columns (i.e. where col1 = 1 and .. col20 = 'Z'
).
x
DELETE sch.table1
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM sch.table1
GROUP BY col1, col2, col3, col4,.. ., col20)
Any way to speed it up, e.g. with indexing?
Advertisement
Answer
Rather than using an anti-join (and finding the non-matching ROWID
and then deleting all the others), you can use the ROW_NUMBER
analytic function to directly find the ROWID
s to delete:
DELETE FROM sch.table1 t
WHERE EXISTS(
SELECT 1
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, , col20
ORDER BY rowid
) AS rn
FROM sch.table1
) x
WHERE x.ROWID = t.ROWID
AND x.rn > 1
);
or:
DELETE FROM sch.table1 t
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, , col20
ORDER BY rowid
) AS rn
FROM sch.table1
)
WHERE rn > 1
);