While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.
The table MyTable
, has a clustered primary key on (Identifier, MyId, date)
. The @xml
variable usually contains tens of entries and data returned is a few hundred thousand rows.
Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?
SET QUOTED_IDENTIFIER ON; DECLARE @xml xml; SET @xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/> <recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/> '; exec sp_executesql N' SELECT date, val FROM MyTable tbl inner join ( SELECT t.data.value(''@MyId'', ''int'') AS xmlMyId, t.data.value(''@StartDate'', ''datetime'') AS xmlStartDate, t.data.value(''@EndDate'', ''datetime'') AS xmlEndDate, t.data.value(''@Identifier'', ''varchar(32)'') as xmlIdentifier FROM @queryXML.nodes(''/recipe'') t(data) ) cont ON tbl.MyId = cont.xmlMyId AND tbl.date >= cont.xmlStartDate AND tbl.date <= cont.xmlEndDate WHERE Identifier = cont.xmlIdentifier ORDER BY date', N'@queryXML xml',@xml;
For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).
SELECT tbl.date, val FROM marketdb.dbo.MyTable tbl INNER JOIN @MyIds ids ON tbl.MyId = ids.MyId AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date)) AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate)) WHERE tbl.Identifier in (SELECT Identifier FROM @identifier_list) AND date >= @start_date AND date <= @end_date
The actual execution plan of the XML query is shown below.
See also:
sp_executesql is slow with parameters
SQL Server doesn’t have the statistics for the table variable?
Advertisement
Answer
As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).
A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.
Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.