I have a dataframe as below which identifies full name of any person:
------------------- | f_name | l_name | ------------------- | abc | xyz | | xyz | abc | | pqr | lmn | -------------------
Here the second row is basically same as first row.
Consider the case where an entry has come in the data where by mistake last name is put under first name(f_name) and first name is put under last name(l_name)
How can I identify and drop/resolve such duplicate/erroneous records in spark dataframe?
Desired Result:
------------------- | f_name | l_name | ------------------- | abc | xyz | | pqr | lmn | -------------------
Solution could be with udf
or SQL
or both. Thnx!
Advertisement
Answer
Use dropDuplicates
function available for Dataset
with the proper key:
val df = Seq( ("abc", "xyz"), ("xyz", "abc"), ("pqr", "lmn") ).toDF("f_name", "l_name") df.withColumn("key", array_sort(array('f_name, 'l_name))).dropDuplicates("key") +------+------+----------+ |f_name|l_name| key| +------+------+----------+ | pqr| lmn|[lmn, pqr]| | abc| xyz|[abc, xyz]| +------+------+----------+