I have a situation where the estimated number of rows in the execution plan is way off
My columns in the join are varchar(50)
. I have tried different indexes but it does not reduce this problem. I have even tried with an index on the temp table. What else can I do?
PS this is the first place where the estimated number starts to drift… Also the tables are not big (48000 rows).
The code is:
SELECT DISTINCT householdnumber, householdid, primaryCustomerID INTO #Households FROM TableA SELECT A.*, MIN(B.[ProfileCreatedDate]) PROFILECREATEDDATE INTO #Profile from #Households AS a LEFT JOIN TableA AS B ON A.[HouseholdNumber]= B.[HouseholdNumber] and A.[HouseholdId]=B.[HouseholdId] GROUP BY a.householdnumber, a.householdid, a.primaryCustomerID;
I know it seems that this can be rewritten as:
SELECT householdnumber, householdid, primaryCustomerID, MIN([ProfileCreatedDate]) AS PROFILECREATEDDATE INTO #Profile2 from TableA GROUP BY householdnumber, householdid, primaryCustomerID;
But the results are not identical and I don’t want to change the results since I am not sure if the creator of this code knew what they were doing.
Some statistics on the columns:
householdnumber
is always equal to householdid
. householdid
is nvarchar(50)
but householdnumber
is varchar(40)
. The table has 48877 rows. Distinct combination of householdnumber, householdid, primaryCustomerID
has 48029 rows. And distinct number of primaryCustomerID
is 47152.
Advertisement
Answer
Regarding the code – it appears that the difference between the larger (original) version and your simpler GROUP BY version is that the original finds the minimum profilecreateddate
for anyone in that household, whereas your simpler version finds the profilecreateddate
for the specific primarycustomerid
.
For example (using simpler data)
CREATE TABLE #TableA (householdnumber int, householdid int, primaryCustomerID int, ProfileCreatedDate datetime); INSERT INTO #TableA (householdnumber, householdid, primaryCustomerID, ProfileCreatedDate) VALUES (1, 1, 1, '20201001'), (1, 1, 1, '20201002'), (1, 1, 2, '20201003'); SELECT DISTINCT householdnumber, householdid, primaryCustomerID INTO #Households FROM #TableA; SELECT A.*, MIN(B.[ProfileCreatedDate]) PROFILECREATEDDATE INTO #Profile from #Households AS a LEFT JOIN #TableA AS B ON A.[HouseholdNumber]= B.[HouseholdNumber] and A.[HouseholdId]=B.[HouseholdId] GROUP BY a.householdnumber, a.householdid, a.primaryCustomerID; SELECT * FROM #Profile; /* -- Results householdnumber householdid primaryCustomerID PROFILECREATEDDATE 1 1 1 2020-10-01 00:00:00.000 1 1 2 2020-10-01 00:00:00.000 */ SELECT householdnumber, householdid, primaryCustomerID, MIN([ProfileCreatedDate]) AS PROFILECREATEDDATE INTO #Profile2 from #TableA GROUP BY householdnumber, householdid, primaryCustomerID; SELECT * FROM #Profile2; /* -- Results householdnumber householdid primaryCustomerID PROFILECREATEDDATE 1 1 1 2020-10-01 00:00:00.000 1 1 2 2020-10-03 00:00:00.000 */
If you notice in the above, the PROFILECREATEDATE for row 2 is different.
You could therefore try the following code that should give the same results as the original set – see how that goes for time (and confirm it matches the original results).
SELECT DISTINCT t1.householdnumber, t1.householdid, primaryCustomerID, MIN([ProfileCreatedDate]) OVER (PARTITION BY t1.householdnumber, t1.householdid) AS PROFILECREATEDDATE INTO #Profile3 FROM #TableA t1; SELECT * FROM #Profile3; /* -- Results householdnumber householdid primaryCustomerID PROFILECREATEDDATE 1 1 1 2020-10-01 00:00:00.000 1 1 2 2020-10-01 00:00:00.000 */