Skip to content
Advertisement

Performance difference with Where condition in subquery/cte

Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance?

Let’s say I have two hive tables A and B which are both partitioned on the field date.

Is that query’s performance the same as the following?

Advertisement

Answer

The answer is: it depends. That said, I’m a fan of putting the filtering as early as possible in the processing. As a general rule, it can’t hurt.

What does it depend on? Well is the CTE materialized? That is, is it saved to an intermediate “table”? This, alas, is controlled by a setting hive.optimize.cte.materialize.threshold. If the CTE is materialized, then you definitely want it filtered in the CTE>

On the other hand, materialization might lose other beneficial information about the original data — such as partitioning schemes. So, once again, it depends.

I do think that a CTE referenced only once is not materialized with the default settings. So, in that context, it doesn’t make a difference.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement