Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance?
Let’s say I have two hive tables A and B which are both partitioned on the field date.
WITH table_A AS ( SELECT date, some_info FROM some_table WHERE date = '2020-12-11' ), table_B AS ( SELECT date, some_other_info FROM some_other_table WHERE date = '2020-12-11' ) SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B
Is that query’s performance the same as the following?
WITH table_A AS ( SELECT table_A, some_info FROM some_table ), table_B AS ( SELECT date, some_other_info FROM some_other_table ) SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B WHERE table_A.date = '2020-12-11'
Advertisement
Answer
The answer is: it depends. That said, I’m a fan of putting the filtering as early as possible in the processing. As a general rule, it can’t hurt.
What does it depend on? Well is the CTE materialized? That is, is it saved to an intermediate “table”? This, alas, is controlled by a setting hive.optimize.cte.materialize.threshold
. If the CTE is materialized, then you definitely want it filtered in the CTE>
On the other hand, materialization might lose other beneficial information about the original data — such as partitioning schemes. So, once again, it depends.
I do think that a CTE referenced only once is not materialized with the default settings. So, in that context, it doesn’t make a difference.