For example,
-- use WHERE SELECT a.id FROM table_a as a, table_b as b WHERE a.id = b.id; -- use JOIN SELECT t1.id FROM ( SELECT id FROM table_a as a ) t1 JOIN ( SELECT id FROM table_b as b ) t2 ON t1.id = t2.id
What is difference between where
and join
in Hive SQL when joining two tables?
Advertisement
Answer
Join like this
FROM table_a as a, table_b as b WHERE a.id = b.id;
is a bad practice because in general, WHERE is being applied after join and transforming it to JOIN and pushing predicates is upon optimizer, to convert it to proper join and avoid CROSS join (join without ON condition).
Always use explicit JOIN with ON
condition when possible, in such way the optimizer will know for sure it is a JOIN condition, also it is ANSI syntax and it is easier to understand.
For not-equi join conditions like a.date between b.start and b.end
it is not possible to use in ON condition, in this case they can be moved to the WHERE. In such case if you do not have other conditions in ON condition, cross join will be used, and after that WHERE filter applied, such join can extremely duplicate data before WHERE filter and cause performance degradation. So, always use explicit ANSI JOIN with ON conditions when possible, always use all equality conditions in the ON and only non-equi conditions in the WHERE if not possible to use them in the ON. Keep join conditions in the ON
and only filters in the WHERE
. Optimizer will push filters to the JOIN or before join when possible but better do not rely on optimizer only, write good ANSI sql which is easy to understand and port to another database if needed.
The difference in plan you can check using EXPLAIN command.