Skip to content
Advertisement

Spark SQL: keep a non-key row after join

I have two dataset as following:

and:

I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.

I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:

But my expected result should be:

Is it possible to achieve this using join directly, if not what is the best way to achieve this ?

Advertisement

Answer

You can replace the ingredient based on the price after the join:

Output:

Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement