Skip to content
Advertisement

How can I find duplicate records in clickhouse [closed]

I want to know how I can find duplicate data entries within one table in clickhouse.

I am actually investigating on a merge tree table and actually threw optimize statements at my table but that didn’t do the trick. The duplicate entries still persist.

Preferred would be to have a universal strategy without referencing individual column names.

I only want to see the duplicate entries, since I am working on very large tables.

Advertisement

Answer

The straight forward way would be to run this query.

SELECT 
    *, 
    count() AS cnt
FROM myDB.myTable
GROUP BY *
HAVING cnt > 1
ORDER BY date ASC

If that query gets to big you can run it in pieces.

SELECT 
    *,
    count() AS cnt
FROM myDB.myTable
WHERE (date >= '2020-08-01') AND (date < '2020-09-01')
GROUP BY *
HAVING cnt > 1
ORDER BY date ASC
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement