Skip to content
Advertisement

how to join two hive tables with embedded array of struct and array on pyspark

I am trying to join two hive tables on databricks.

tab1:

The schema of “some_questions”

“some_questions” example:

tab2:

I need to join tab1 and tab2 by “question_id” such that I get a new table

I try to join them by pyspark. But, I am not sure how to decompose the array with embedded struct/array.

thanks

Advertisement

Answer

For SparkSQL, you can use either exists:

or array_contains:

with PySpark syntax:

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement