Skip to content
Advertisement

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.

First approach is to use following two queries:

second query is as below which will give count of distinct rows

With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn’t tell which rows are duplicated.

My second approach to find duplicate is:

Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.

So I would like to know:

  1. If my first approach is correct – if yes then how do I find which rows are duplicated
  2. Why second approach is not providing list of rows which are duplicated?
  3. Is there any other way to find the duplicates?

Advertisement

Answer

Hive does not validate primary and foreign key constraints.

Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.

That means that Hive allows duplicates in Primary Keys.

To solve your issue, you should do something like this:

This way you will get list of duplicated rows.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement