I have two tables:
CREATE TABLE Employee ( Site ???? ????, WorkTypeId char(2) NOT NULL, Emp_NO int NOT NULL, "Date" ???? NOT NULL ); CREATE TABLE PTO ( Site ???? ????, WorkTypeId char(2) NULL, Emp_NO int NOT NULL, "Date" ???? NOT NULL );
I would like to update values in PTO
‘s WorkTypeId
column:
EMP NO
inEmployee
(the lookup table) andPTO
should match.- A single
WorkTypeId
value should be picked from only the first occurrence of the month.
For example, given this sample input data:
TABLE Employee
:
Site | WorkTypeId | Emp_NO | Date |
---|---|---|---|
5015 | MB | 1005 | 2022-02-01 |
5015 | MI | 1005 | 2022-02-04 |
5015 | PO | 1005 | 2022-02-04 |
5015 | ME | 2003 | 2022-01-01 |
5015 | TT | 2003 | 2022-01-10 |
TABLE PTO
:
Site | WorkTypeId | Emp_NO | Date |
---|---|---|---|
5015 | 1005 | 2022-02-03 | |
5015 | 1005 | 2022-02-14 | |
5014 | 2003 | 2022-01-09 |
For example:
- Given
Employee
withEmp_NO = 1005
…- …there are 3 rows for that
Emp_NO
in theEmployee
table, with 3 distinctWorkTypeId
values, but differingDate
values. - So pick the
WorkTypeId
value for the earliestDate
(2022-02-01), which is'MB'
- So
Emp_NO
getsWorkTypeId = 'MB'
. - And use that single value to fill
1005
‘sWorkTypeId
cells in thePTO
table. - But also match by month.
- …there are 3 rows for that
So the expected output in the PTO
table is
Site | WorkTypeId | Emp_NO | Date |
---|---|---|---|
5015 | MB | 1005 | 2022-02-03 |
5015 | MB | 1005 | 2022-02-14 |
5014 | ME | 2003 | 2022-01-09 |
Advertisement
Answer
Getting a value from a column different to the column used in a MIN
/MAX
expression in a GROUP BY
query still remains a surprisingly difficult thing to do in SQL, and while modern versions of the SQL language (and SQL Server) make it easier, they’re completely non-obvious and counter-intuitive to most people as it necessarily involves more advanced topics like CTEs, derived-tables (aka inner-queries), self-joins and windowing-functions despite the conceptually simple nature of the query.
Anyway, as-ever in modern SQL, there’s usually 3 or 4 different ways to accomplish the same task, with a few gotchas.
Preface:
As
Site
,Date
,Year
, andMonth
are all keywords in T-SQL, I’ve escaped them with double-quotes, which is the ISO/ANSI SQL Standards compliant way to escape reserved words.- SQL Server supports this by default. If (for some ungodly reason) you have
SET QUOTED IDENTIFIER OFF
then change the double-quotes to square-brackets:[]
- SQL Server supports this by default. If (for some ungodly reason) you have
I assume that the
Site
column in both tables is just a plain’ ol’ data column, as such:- It is not a
PRIMARY KEY
member column. - It should not be used as a
GROUP BY
. - It should not be used in a
JOIN
predicate.
- It is not a
All of the approaches below assume this database state:
CREATE TABLE "Employee" ( "Site" int NOT NULL, WorkTypeId char(2) NOT NULL, Emp_NO int NOT NULL, "Date" date NOT NULL ); CREATE TABLE "PTO" ( "Site" int NOT NULL, WorkTypeId char(2) NULL, Emp_NO int NOT NULL, "Date" date NOT NULL ); GO INSERT INTO "Employee" ( "Site", WorkTypeId, Emp_NO, "Date" ) VALUES ( 5015, 'MB', 1005, '2022-02-01' ), ( 5015, 'MI', 1005, '2022-02-04' ), ( 5015, 'PO', 1005, '2022-02-04' ), ( 5015, 'ME', 2003, '2022-01-01' ), ( 5015, 'TT', 2003, '2022-01-10' ); INSERT INTO "PTO" ( "Site", WorkTypeId, Emp_NO, "Date" ) VALUES ( 5015, NULL, 1005, '2022-02-03' ), ( 5015, NULL, 1005, '2022-02-14' ), ( 5014, NULL, 2003, '2022-01-09' );
- Both approaches define CTEs
e
andp
that extendEmployee
andPTO
respectively to add computed"Year"
and"Month"
columns, which avoids having to repeatedly useYEAR( "Date" ) AS "Year"
inGROUP BY
andJOIN
expressions.- I suggest you add those as computed-columns in your base tables, if you’re able, as they’ll be useful generally anyway. Don’t forget to index them appropriately too.
Approach 1: Composed CTEs with elementary aggregates, then UPDATE
:
WITH -- Step 1: Extend both the `Employee` and `PTO` tables with YEAR and MONTH columns (this simplifies things later on): e AS ( SELECT Emp_No, "Site", WorkTypeId, "Date", YEAR( "Date" ) AS "Year", MONTH( "Date" ) AS "Month" FROM Employee ), p AS ( SELECT Emp_No, "Site", WorkTypeId, "Date", YEAR( "Date" ) AS "Year", MONTH( "Date" ) AS "Month" FROM PTO ), -- Step 2: Get the MIN( "Date" ) value for each group: minDatesForEachEmployeeMonthYearGroup AS ( SELECT e.Emp_No, e."Year", e."Month", MIN( "Date" ) AS "FirstDate" FROM e GROUP BY e.Emp_No, e."Year", e."Month" ), -- Step 3: INNER JOIN back on `e` to get the first WorkTypeId in each group: firstWorkTypeIdForEachEmployeeMonthYearGroup AS ( /* WARNING: This query will fail if multiple rows (for the same Emp_NO, Year and Month) have the same "Date" value. This can be papered-over with GROUP BY and MIN, but I don't think that's a good idea at all). */ SELECT e.Emp_No, e."Year", e."Month", e.WorkTypeId AS FirstWorkTypeId FROM e INNER JOIN minDatesForEachEmployeeMonthYearGroup AS q ON e.Emp_NO = q.Emp_NO AND e."Date" = q.FirstDate ) -- Step 4: Do the UPDATE. -- *Yes*, you can UPDATE a CTE (provided the CTE is "simple" and has a 1:1 mapping back to source rows on-disk). UPDATE p SET p.WorkTypeId = f.FirstWorkTypeId FROM p INNER JOIN firstWorkTypeIdForEachEmployeeMonthYearGroup AS f ON p.Emp_No = f.Emp_No AND p."Year" = f."Year" AND p."Month" = f."Month" WHERE p.WorkTypeId IS NULL;
Here’s a screenshot of SSMS showing the contents of the PTO
table from before, and after, the above query runs:
Approach 2: Skip the self-JOIN
with FIRST_VALUE
:
This approach gives a shorter, slightly simpler query, but requires SQL Server 2012 or later (and that your database is running in compatibility-level 110 or higher).
Surprisingly, you cannot use FIRST_VALUE
in a GROUP BY
query, despite its obvious similarities with MIN
, but an equivalent query can be built with SELECT DISTINCT
:
WITH -- Step 1: Extend the `Employee` table with YEAR and MONTH columns: e AS ( SELECT Emp_No, "Site", WorkTypeId, "Date", YEAR( "Date" ) AS "Year", MONTH( "Date" ) AS "Month" FROM Employee ), firstWorkTypeIdForEachEmployeeMonthYearGroup AS ( SELECT DISTINCT e.Emp_No, e."Year", e."Month", FIRST_VALUE( WorkTypeId ) OVER ( PARTITION BY Emp_No, e."Year", e."Month" ORDER BY "Date" ASC ) AS FirstWorkTypeId FROM e ) -- Step 3: UPDATE PTO: UPDATE p SET p.WorkTypeId = f.FirstWorkTypeId FROM PTO AS p INNER JOIN firstWorkTypeIdForEachEmployeeMonthYearGroup AS f ON p.Emp_No = f.Emp_No AND YEAR( p."Date" ) = f."Year" AND MONTH( p."Date" ) = f."Month" WHERE p.WorkTypeId IS NULL;
Doing a SELECT * FROM PTO
after this runs gives me the exact same output as Approach 2.
Approach 2b, but made shorter:
Just so @SOS doesn’t feel too smug about their SQL being considerably more shorter than mine ð, the Approach 2 SQL above can be compacted down to this:
WITH empYrMoGroups AS ( SELECT DISTINCT e.Emp_No, YEAR( e."Date" ) AS "Year", MONTH( e."Date" ) AS "Month", FIRST_VALUE( e.WorkTypeId ) OVER ( PARTITION BY e.Emp_No, YEAR( e."Date" ), MONTH( e."Date" ) ORDER BY e."Date" ASC ) AS FirstWorkTypeId FROM Employee AS e ) UPDATE p SET p.WorkTypeId = f.FirstWorkTypeId FROM PTO AS p INNER JOIN empYrMoGroups AS f ON p.Emp_No = f.Emp_No AND YEAR( p."Date" ) = f."Year" AND MONTH( p."Date" ) = f."Month" WHERE p.WorkTypeId IS NULL;
- The execution-plans for both Approach 2 and Approach 2b are almost identical, excepting that Approach 2b has an extra Computed Scalar step for some reason.
- The execution plans for Approach 1 and Approach 2 are very different, however, with Approach 1 having more branches than Approach 2 despite their similar semantics.
- But my execution-plans won’t match yours because it’s very context-dependent, especially w.r.t. what indexes and PKs you have, and if there’s any other columns involved, etc.
Approach 1‘s plan looks like this:
Approach 2b‘s plan looks like this:
@SOS’s plan, for comparison, is a lot simpler… and I honestly don’t know why, but it does show how good SQL Server’s query optimizer is thesedays: