How to find duplicate rows in Hive?

Question

I want to find duplicate rows from one of the Hive table for which I was given two approaches. First approach is to use following two queries: second query is as below which will give count of distinct rows With this approach, for one of my table total row count derived using first query is 3500 and second qu…

Accepted Answer

Hive does not validate primary and foreign key constraints.   Since these constraints are not validated, an upstream system needs to  ensure data integrity before it is loaded into Hive.That means that Hive allows duplicates in Primary Keys. To solve your issue, you should do something like this:select [every column], count(*)from mytablegroup by [every column]having count(*) > 1;This way you will get list of duplicated rows.

Advertisement

Answer