I’ve got a unique problem. I’m querying a replicated database table cost_plan_breakdown, and the replication is known to have some duplicates due to issues with deleting records. I’m not the Admin so I’m trying to sidestep these duplicates as efficiently as possible. The table looks like this:
sys_id | sys_created_on | cost_plan | breakdown_start_date |
---|---|---|---|
axr123 | 2020-10-01 09:31:15 | Outlook KTLO – Lisa Lymon | 10-01-2020 |
pqo100 | 2020-12-23 05:50:20 | Outlook KTLO – Lisa Lymon | 10-01-2020 |
cji985 | 2020-10-01 09:31:15 | Outlook KTLO – Lisa Lymon | 11-01-2020 |
twg795 | 2020-10-05 13:23:08 | DataPyramid CTB – Dave Dods | 10-01-2020 |
jqr820 | 2020-09-28 16:11:54 | Revoluccion CTB – Marcus Vance | 11-01-2020 |
vjo150 | 2021-01-13 11:10:09 | Server KTLO – Tom Smith | 10-01-2020 |
Cost Plans typically have between 1 and 12 breakdowns during their lifespan, but there should only be one breakdown per cost plan per month. Notice that the Outlook Cost Plan has two breakdowns within the same month (October) with differing sys_id and sys_created_on.
So by using a smaller subquery in the where clause, I’m trying to determine the following:
“Group the rows with identical month and year of breakdown_start_date, and identical cost_plan. Of the remaining rows, select the one with the MAX sys_created_on. Take the sys_id of that row and feed it to the parent query to only include these rows.”
...rest of query above WHERE cpb.breakdown_type = 'requirement' AND cpb.sys_id IN (SELECT cpb2.sys_id FROM cost_plan_breakdown cpb2 GROUP BY cpb2.name, YEAR(cpb2.start_date_time), MONTH(cpb2.start_date_time) HAVING MAX(cpb2.sys_created_on))
At this point, I’m running into the error
cpb2.sys_id is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I’ve previously semi-solved this by putting the MAX sys_created_on in the SELECT statement, and matching off that, but I realized that could pull in unwanted dupe records just because they match the sys_created_on of another.
I feel like the solution may be staring me in the face, but I’m stuck. Appreciate your help!
Advertisement
Answer
Use row_number
to number the duplicate rows and then exclude them. Ordering the row number by sys_created_on desc
ensures you get the latest of each per month.
declare @Test table (sys_id varchar(6), sys_created_on datetime2(0), cost_plan varchar(32), breakdown_start_date date); insert into @Test (sys_id, sys_created_on, cost_plan, breakdown_start_date) values ('axr123', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '10-01-2020'), ('pqo100', '2020-12-23 05:50:20', 'Outlook KTLO - Lisa Lymon', '10-01-2020'), ('cji985', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '11-01-2020'), ('twg795', '2020-10-05 13:23:08', 'DataPyramid CTB - Dave Dods', '10-01-2020'), ('jqr820', '2020-09-28 16:11:54', 'Revoluccion CTB - Marcus Vance', '11-01-2020'), ('vjo150', '2021-01-13 11:10:09', 'Server KTLO - Tom Smith', '10-01-2020'); with cte as ( select * , row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn from @Test ) select * from cte where rn = 1;
As per your comments this (the CTE) is just a neat way to write a sub-query/derived table and can still be written as follows:
select * from ( select * , row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn from @Test ) cte where rn = 1;
Note: If you provide DDL+DML as shown above you make it much easier for people to assist.