Skip to content
Advertisement

Performance difference with Where condition in subquery/cte

Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance?

Let’s say I have two hive tables A and B which are both partitioned on the field date.

WITH table_A AS (
  SELECT date, some_info FROM some_table WHERE date = '2020-12-11'
),
table_B AS (
  SELECT date, some_other_info FROM some_other_table WHERE date = '2020-12-11'
)

SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B

Is that query’s performance the same as the following?

WITH table_A AS (
  SELECT table_A, some_info FROM some_table
),
table_B AS (
  SELECT date, some_other_info FROM some_other_table 
)

SELECT table_A.*, table_B.some_other_info FROM table_A INNER JOIN table_B WHERE table_A.date = '2020-12-11'

Advertisement

Answer

The answer is: it depends. That said, I’m a fan of putting the filtering as early as possible in the processing. As a general rule, it can’t hurt.

What does it depend on? Well is the CTE materialized? That is, is it saved to an intermediate “table”? This, alas, is controlled by a setting hive.optimize.cte.materialize.threshold. If the CTE is materialized, then you definitely want it filtered in the CTE>

On the other hand, materialization might lose other beneficial information about the original data — such as partitioning schemes. So, once again, it depends.

I do think that a CTE referenced only once is not materialized with the default settings. So, in that context, it doesn’t make a difference.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement