In spark-sql I have a query that uses several tables (both large & small) in Joins. My question is – does the order of these tables matter with respect to query performance ?
For e.g.
select larger.col1, smaller.col2 from larger Left Join smaller on larger.key = smaller.key Left Join larger2 on larger2.key2 = smaller.key2
I have searched online but didn’t get a conclusive answer. So, will the performance vary if I change the order of the left vs right tables ?
Advertisement
Answer
The join order seems to be changed for optimization by Spark.
There could be :
- Reorder JOIN optimizer
- Reorder JOIN optimizer – star schema
- Reorder JOIN optimizer – cost based optimization
The following appears to shed some light on this topic:
https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-star-schema/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-cost-based-optimization/read