Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance?
Let’s say I have two hive tables A and B which are both partitioned on the field date.
WITH table_A AS (
SELECT date, some_info FROM some_table WHERE date = '2020-12-11'
),
table_B AS (
SELECT date, some_other_info FROM some_other_table WHERE date = '2020-12-11'
)
SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B
Is that query’s performance the same as the following?
WITH table_A AS (
SELECT table_A, some_info FROM some_table
),
table_B AS (
SELECT date, some_other_info FROM some_other_table
)
SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B WHERE table_A.date = '2020-12-11'
Advertisement
Answer
The answer is: it depends. That said, I’m a fan of putting the filtering as early as possible in the processing. As a general rule, it can’t hurt.
What does it depend on? Well is the CTE materialized? That is, is it saved to an intermediate “table”? This, alas, is controlled by a setting hive.optimize.cte.materialize.threshold
. If the CTE is materialized, then you definitely want it filtered in the CTE>
On the other hand, materialization might lose other beneficial information about the original data — such as partitioning schemes. So, once again, it depends.
I do think that a CTE referenced only once is not materialized with the default settings. So, in that context, it doesn’t make a difference.