I want to know how I can find duplicate data entries within one table in clickhouse.
I am actually investigating on a merge tree table and actually threw optimize statements at my table but that didn’t do the trick. The duplicate entries still persist.
Preferred would be to have a universal strategy without referencing individual column names.
I only want to see the duplicate entries, since I am working on very large tables.
Advertisement
Answer
The straight forward way would be to run this query.
x
SELECT
*,
count() AS cnt
FROM myDB.myTable
GROUP BY *
HAVING cnt > 1
ORDER BY date ASC
If that query gets to big you can run it in pieces.
SELECT
*,
count() AS cnt
FROM myDB.myTable
WHERE (date >= '2020-08-01') AND (date < '2020-09-01')
GROUP BY *
HAVING cnt > 1
ORDER BY date ASC