How to get the COUNT of emails for each id in Scala

Question

I use this query in SQL to get return how many user_id&#8217;s have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact &#8230;

Accepted Answer

Let&#8217;s assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:import spark.implicits._val df = Seq(("me", "contact@me.com"),             ("me", "me@company.com"),             ("you", "you@company.com")).toDF("user_id", "email")df.show()+-------+---------------+|user_id|          email|+-------+---------------+|     me| contact@me.com||     me| me@company.com||    you|you@company.com|+-------+---------------+ Now, the logic would be very similar as the one you have in SQL:df.groupBy("user_id")  .agg(countDistinct("email") as "count")  .where('count > 1)  .show()+-------+-----+|user_id|count|+-------+-----+|     me|    2|+-------+-----+Then you can add a .drop("count") or a .select("user_id") to only keep users.Note that there is no having clause in spark. Once you have called agg to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count column here.

Advertisement

Answer