Skip to content
Advertisement

Identify duplicate record in Dataframe

I have a dataframe as below which identifies full name of any person:

-------------------
| f_name | l_name |
-------------------
| abc    | xyz    |
| xyz    | abc    |
| pqr    | lmn    |
-------------------

Here the second row is basically same as first row.

Consider the case where an entry has come in the data where by mistake last name is put under first name(f_name) and first name is put under last name(l_name)

How can I identify and drop/resolve such duplicate/erroneous records in spark dataframe?

Desired Result:

-------------------
| f_name | l_name |
-------------------
| abc    | xyz    |
| pqr    | lmn    |
-------------------

Solution could be with udf or SQL or both. Thnx!

Advertisement

Answer

Use dropDuplicates function available for Dataset with the proper key:

val df = Seq(
  ("abc", "xyz"),
  ("xyz", "abc"),
  ("pqr", "lmn")
).toDF("f_name", "l_name")

df.withColumn("key", array_sort(array('f_name, 'l_name))).dropDuplicates("key")

+------+------+----------+
|f_name|l_name|       key|
+------+------+----------+
|   pqr|   lmn|[lmn, pqr]|
|   abc|   xyz|[abc, xyz]|
+------+------+----------+
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement