I want to know how I can find duplicate data entries within one table in clickhouse.
I am actually investigating on a merge tree table and actually threw optimize statements at my table but that didn’t do the trick. The duplicate entries still persist.
Preferred would be to have a universal strategy without referencing individual column names.
I only want to see the duplicate entries, since I am working on very large tables.
Advertisement
Answer
The straight forward way would be to run this query.
SELECT *, count() AS cnt FROM myDB.myTable GROUP BY * HAVING cnt > 1 ORDER BY date ASC
If that query gets to big you can run it in pieces.
SELECT *, count() AS cnt FROM myDB.myTable WHERE (date >= '2020-08-01') AND (date < '2020-09-01') GROUP BY * HAVING cnt > 1 ORDER BY date ASC