Skip to content
Advertisement

Order of the tables in a JOIN

In spark-sql I have a query that uses several tables (both large & small) in Joins. My question is – does the order of these tables matter with respect to query performance ? For e.g.

select larger.col1, smaller.col2
from larger
Left Join smaller
on larger.key = smaller.key
Left Join larger2
on larger2.key2 = smaller.key2

I have searched online but didn’t get a conclusive answer. So, will the performance vary if I change the order of the left vs right tables ?

Advertisement

Answer

The join order seems to be changed for optimization by Spark.

There could be :

  • Reorder JOIN optimizer
  • Reorder JOIN optimizer – star schema
  • Reorder JOIN optimizer – cost based optimization

The following appears to shed some light on this topic:

https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-star-schema/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-cost-based-optimization/read

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement