I use this query in SQL to get return how many user_id’s have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact emails for each user_id
SELECT DISTINCT user_id FROM Users Group by user_id Having count(DISTINCT email) > 1
Advertisement
Answer
Let’s assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:
import spark.implicits._ val df = Seq(("me", "contact@me.com"), ("me", "me@company.com"), ("you", "you@company.com")).toDF("user_id", "email") df.show() +-------+---------------+ |user_id| email| +-------+---------------+ | me| contact@me.com| | me| me@company.com| | you|you@company.com| +-------+---------------+
Now, the logic would be very similar as the one you have in SQL:
df.groupBy("user_id") .agg(countDistinct("email") as "count") .where('count > 1) .show() +-------+-----+ |user_id|count| +-------+-----+ | me| 2| +-------+-----+
Then you can add a .drop("count")
or a .select("user_id")
to only keep users.
Note that there is no having
clause in spark. Once you have called agg
to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count
column here.