Skip to content
Advertisement

What is difference between where and join in Hive SQL when joining two tables?

For example,

-- use WHERE
SELECT
    a.id
FROM
    table_a as a,
    table_b as b
WHERE
    a.id = b.id;

-- use JOIN
SELECT
    t1.id
FROM
    (
        SELECT
            id
        FROM
            table_a as a
    ) t1
    JOIN (
        SELECT
            id
        FROM
            table_b as b
    ) t2 ON t1.id = t2.id

What is difference between where and join in Hive SQL when joining two tables?

Advertisement

Answer

Join like this

FROM
    table_a as a,
    table_b as b
WHERE
    a.id = b.id;

is a bad practice because in general, WHERE is being applied after join and transforming it to JOIN and pushing predicates is upon optimizer, to convert it to proper join and avoid CROSS join (join without ON condition).

Always use explicit JOIN with ON condition when possible, in such way the optimizer will know for sure it is a JOIN condition, also it is ANSI syntax and it is easier to understand.

For not-equi join conditions like a.date between b.start and b.end it is not possible to use in ON condition, in this case they can be moved to the WHERE. In such case if you do not have other conditions in ON condition, cross join will be used, and after that WHERE filter applied, such join can extremely duplicate data before WHERE filter and cause performance degradation. So, always use explicit ANSI JOIN with ON conditions when possible, always use all equality conditions in the ON and only non-equi conditions in the WHERE if not possible to use them in the ON. Keep join conditions in the ON and only filters in the WHERE. Optimizer will push filters to the JOIN or before join when possible but better do not rely on optimizer only, write good ANSI sql which is easy to understand and port to another database if needed.

The difference in plan you can check using EXPLAIN command.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement