Skip to content
Advertisement

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn’t include rows with null by default.

Here is the default Spark behavior.

Here is the output of joinedDf.show():

This is the output I would like:

Advertisement

Answer

Spark provides a special NULL safe equality operator:

Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111Fast null-safe join).

In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:

and %<=>% in SparkR:

With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:

This is can be used with DataFrame API as well:

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement