how to join two hive tables with embedded array of struct and array on pyspark

Question

I am trying to join two hive tables on databricks. tab1: The schema of "some_questions" "some_questions" example: tab2: I need to join tab1 and tab2 by "question_id" such that I get a new table I try to join them by pyspark. But, I am not sure how to decompose the array with embedded struct/array. thanks Answer For SparkSQL, you can

Accepted Answer

For SparkSQL, you can use either exists:spark.sql("""  SELECT t1.consumer_id, t2.answer_id, t2.question_contents, t2.answer_contents   FROM tab1 as t1  JOIN tab2 as t2 ON exists(t1.some_questions, x -> x.question_id=t2.question_id)""").show()+-----------+--------------------+--------------------+--------------------+|consumer_id|           answer_id|   question_contents|     answer_contents|+-----------+--------------------+--------------------+--------------------+| reghvsdvwe|4363r23-46745y3-2...|what do you like ...|       smell is good|| reghvsdvwe|eewgrg-2353t3-the...|how do you enjoy ...|too much traffic ...|+-----------+--------------------+--------------------+--------------------+or array_contains:spark.sql("""   SELECT t1.consumer_id, t2.answer_id, t2.question_contents, t2.answer_contents   FROM tab1 as t1   JOIN tab2 as t2 ON array_contains(t1.some_questions.question_id, t2.question_id)""").show()with PySpark syntax:from pyspark.sql.functions import exprdf_new = tab1.alias('t1').join(  tab2.alias('t2'),   expr("array_contains(t1.some_questions.question_id, t2.question_id)")).select('t1.consumer_id', 't2.question_id', 't2.question_contents', 't2.answer_contents')

Advertisement

Answer