Performance difference with Where condition in subquery/cte

Question

Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance? Let's say I have two hive tables A and B which are both partitioned on the field date. Is that query's performance the same as the following? Answer The

Accepted Answer

The answer is:  it depends.  That said, I&#8217;m a fan of putting the filtering as early as possible in the processing.  As a general rule, it can&#8217;t hurt.What does it depend on?  Well is the CTE materialized?  That is, is it saved to an intermediate &#8220;table&#8221;?  This, alas, is controlled by a setting hive.optimize.cte.materialize.threshold.  If the CTE is materialized, then you definitely want it filtered in the CTE>On the other hand, materialization might lose other beneficial information about the original data &#8212; such as partitioning schemes.  So, once again, it depends.I do think that a CTE referenced only once is not materialized with the default settings.  So, in that context, it doesn&#8217;t make a difference.

Advertisement

Answer