Skip to content
Advertisement

How to select a foreign key after narrowing down via Group By and Having in a subquery

I’ve got a unique problem. I’m querying a replicated database table cost_plan_breakdown, and the replication is known to have some duplicates due to issues with deleting records. I’m not the Admin so I’m trying to sidestep these duplicates as efficiently as possible. The table looks like this:

sys_id sys_created_on cost_plan breakdown_start_date
axr123 2020-10-01 09:31:15 Outlook KTLO – Lisa Lymon 10-01-2020
pqo100 2020-12-23 05:50:20 Outlook KTLO – Lisa Lymon 10-01-2020
cji985 2020-10-01 09:31:15 Outlook KTLO – Lisa Lymon 11-01-2020
twg795 2020-10-05 13:23:08 DataPyramid CTB – Dave Dods 10-01-2020
jqr820 2020-09-28 16:11:54 Revoluccion CTB – Marcus Vance 11-01-2020
vjo150 2021-01-13 11:10:09 Server KTLO – Tom Smith 10-01-2020

Cost Plans typically have between 1 and 12 breakdowns during their lifespan, but there should only be one breakdown per cost plan per month. Notice that the Outlook Cost Plan has two breakdowns within the same month (October) with differing sys_id and sys_created_on.

So by using a smaller subquery in the where clause, I’m trying to determine the following:

“Group the rows with identical month and year of breakdown_start_date, and identical cost_plan. Of the remaining rows, select the one with the MAX sys_created_on. Take the sys_id of that row and feed it to the parent query to only include these rows.”

...rest of query above    
WHERE cpb.breakdown_type = 'requirement'    
AND cpb.sys_id IN
              (SELECT cpb2.sys_id
               FROM cost_plan_breakdown cpb2
               GROUP BY cpb2.name,
                        YEAR(cpb2.start_date_time),
                        MONTH(cpb2.start_date_time)
               HAVING MAX(cpb2.sys_created_on))

At this point, I’m running into the error

cpb2.sys_id is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

I’ve previously semi-solved this by putting the MAX sys_created_on in the SELECT statement, and matching off that, but I realized that could pull in unwanted dupe records just because they match the sys_created_on of another.

I feel like the solution may be staring me in the face, but I’m stuck. Appreciate your help!

Advertisement

Answer

Use row_number to number the duplicate rows and then exclude them. Ordering the row number by sys_created_on desc ensures you get the latest of each per month.

declare @Test table (sys_id varchar(6), sys_created_on datetime2(0), cost_plan varchar(32), breakdown_start_date date);

insert into @Test (sys_id, sys_created_on, cost_plan, breakdown_start_date)
values
('axr123', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '10-01-2020'),
('pqo100', '2020-12-23 05:50:20', 'Outlook KTLO - Lisa Lymon', '10-01-2020'),
('cji985', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '11-01-2020'),
('twg795', '2020-10-05 13:23:08', 'DataPyramid CTB - Dave Dods', '10-01-2020'),
('jqr820', '2020-09-28 16:11:54', 'Revoluccion CTB - Marcus Vance', '11-01-2020'),
('vjo150', '2021-01-13 11:10:09', 'Server KTLO - Tom Smith', '10-01-2020');

with cte as (
    select *
        , row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn
    from @Test
)
select *
from cte
where rn = 1;

As per your comments this (the CTE) is just a neat way to write a sub-query/derived table and can still be written as follows:

select *
from (
    select *
        , row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn
    from @Test
) cte
where rn = 1;

Note: If you provide DDL+DML as shown above you make it much easier for people to assist.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement