Filter a Dataframe using a subset of it and two specific fields in spark/scala [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers. Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question. Closed 10 months ago. Improve this question I have an Scala/Spark question. I'm using Spark 2.1.1. I have a Dataframe

Accepted Answer

If you want to substract subset dataframe from first dataframe, you can use a left anti join, as follows:dataframe.join(subset, dataframe.columns, "left_anti")Given your input dataframe and your subset, you will get:+-------+------------+------+-------------+|client |transaction |amount|machine      |+-------+------------+------+-------------+|0000001|transaction1|-0.01 |user000000001||0000001|transaction4|0.01  |user000000011|+-------+------------+------+-------------+Then you can get the machine column and use an inner join to filter duplicates in your first dataframe. Complete code would be as follows:dataframe.join(subset, dataframe.columns, "left_anti")  .select("machine")  .join(dataframe, Seq("machine"))And you will get your expected result:+-------------+-------+------------+------+|machine      |client |transaction |amount|+-------------+-------+------------+------+|user000000001|0000001|transaction1|-0.01 ||user000000001|0000002|transaction2|0.01  ||user000000011|0000003|transaction3|-0.01 ||user000000011|0000001|transaction4|0.01  |+-------------+-------+------------+------+However, in your case, I don&#8217;t think you need to build the subset dataframe, you can get your result by using only first dataframe, as follows:dataframe.groupBy("transaction")  .agg(count("transaction").as("total"), first("machine").as("machine"))  .filter(col("total") === 1)  .select("machine")  .join(dataframe, Seq("machine"))

client	transaction	amount	machine
0000001	transaction1	-0.010000	user000000001
0000002	transaction2	0.010000	user000000001
0000002	transaction2	0.010000	user000000002
0000002	transaction2	0.010000	user000000003
0000003	transaction3	-0.010000	user000000004
0000003	transaction3	-0.010000	user000000002
0000003	transaction3	-0.010000	user000000003
0000003	transaction3	-0.010000	user000000011
0000001	transaction4	0.010000	user000000011

Advertisement

Answer