Maximizing query performance by joining with XML

While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.

The table MyTable, has a clustered primary key on (Identifier, MyId, date). The @xml variable usually contains tens of entries and data returned is a few hundred thousand rows.

Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?

SET QUOTED_IDENTIFIER ON;
DECLARE @xml xml;
SET @xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/>
<recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>
';

exec sp_executesql N'
SELECT date, val
FROM MyTable tbl
inner join (    
    SELECT t.data.value(''@MyId'', ''int'') AS xmlMyId,
           t.data.value(''@StartDate'', ''datetime'') AS xmlStartDate,
           t.data.value(''@EndDate'', ''datetime'') AS xmlEndDate,
           t.data.value(''@Identifier'', ''varchar(32)'') as xmlIdentifier
    FROM @queryXML.nodes(''/recipe'') t(data) ) cont
ON tbl.MyId = cont.xmlMyId
AND tbl.date >= cont.xmlStartDate
AND tbl.date <= cont.xmlEndDate
WHERE Identifier = cont.xmlIdentifier
ORDER BY date', N'@queryXML xml',@xml;

​x
 
SET QUOTED_IDENTIFIER ON;DECLARE @xml xml;SET @xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/><recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>';​exec sp_executesql N'SELECT date, valFROM MyTable tblinner join (        SELECT t.data.value(''@MyId'', ''int'') AS xmlMyId,           t.data.value(''@StartDate'', ''datetime'') AS xmlStartDate,           t.data.value(''@EndDate'', ''datetime'') AS xmlEndDate,           t.data.value(''@Identifier'', ''varchar(32)'') as xmlIdentifier    FROM @queryXML.nodes(''/recipe'') t(data) ) contON tbl.MyId = cont.xmlMyIdAND tbl.date >= cont.xmlStartDateAND tbl.date <= cont.xmlEndDateWHERE Identifier = cont.xmlIdentifierORDER BY date', N'@queryXML xml',@xml;​

For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).

SELECT tbl.date, val
FROM marketdb.dbo.MyTable tbl
INNER JOIN @MyIds ids ON tbl.MyId = ids.MyId
    AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))
    AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))
WHERE tbl.Identifier in (SELECT Identifier FROM @identifier_list)  AND date >= @start_date AND date <= @end_date

 
SELECT tbl.date, valFROM marketdb.dbo.MyTable tblINNER JOIN @MyIds ids ON tbl.MyId = ids.MyId    AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))    AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))WHERE tbl.Identifier in (SELECT Identifier FROM @identifier_list)  AND date >= @start_date AND date <= @end_date​

The actual execution plan of the XML query is shown below.

Answer

As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).

A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.

Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.

Advertisement

Answer