Spark SQL: keep a non-key row after join

Question

I have two dataset as following: and: I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field. I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives: But my expected

Accepted Answer

You can replace the ingredient based on the price after the join:import org.apache.spark.sql.functions._smoothieDs.join(ingredientDs, "smoothie")  .withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))  .distinct()  .show()Output:+------------+-----------+-----+------------------+|    smoothie|smoothie_id|price|        ingredient|+------------+-----------+-----+------------------+|Green veggie|          2|   20|To be communicated||    Tropical|          1|   10|             Mango||    Tropical|          1|   10|     Passion fruit|+------------+-----------+-----+------------------+Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.smoothieDs.join(  ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),  Seq("smoothie"), "left_outer")  .na.fill("To be communicated", Seq("ingredient"))  .show()

Advertisement

Answer