How can I obtain the primary key for a row associated with the minimum value in a subset specified in a group by?

I have been working on some database queries and am running into some issues with trying to obtain a primary key for a row where an aggregated value was obtained. I am using Postgresql v10.

Take for example the following table:

instructor_schedule_id	instructor_name	course_name	start_date
1	bob	databases	2015-01-01 00:00:00.000000 +00:00
2	bob	databases	2018-01-01 00:00:00.000000 +00:00
3	bob	databases	2024-01-01 00:00:00.000000 +00:00
4	alice	databases	2021-01-01 00:00:00.000000 +00:00
5	alice	databases	2022-01-01 00:00:00.000000 +00:00

What I would like to obtain, is the “current” value for each instructor/course combination, where current is determined by finding the most recent record from the past.

My desired result set would look like this:

instructor_schedule_id	instructor_name	course_name	start_date
2	bob	databases	2018-01-01 00:00:00.000000 +00:00
5	alice	databases	2022-01-01 00:00:00.000000 +00:00

As you can see, we have the most recent assignment to each instructor for each of their courses (only one course in this example, but there are typically many). bobs 2024 assignment hasn’t happened yet, and bobs 2015 assignment has expired since we now have a 2018 assignment. Same for alice.

Looking at this, what I WANT to do is:

select * 
from instructor_schedule 
where instructor_schedule_id in (select instructor_schedule_id 
                                 from 
                                     (select
                                          min(start_date),
                                          course_name,
                                          instructor_name,
                                          instructor_schedule_id
                                      from
                                          instructor_schedule
                                      group by
                                          course_name, instructor_name)  inner);

​x
 
select * from instructor_schedule where instructor_schedule_id in (select instructor_schedule_id                                  from                                      (select                                          min(start_date),                                          course_name,                                          instructor_name,                                          instructor_schedule_id                                      from                                          instructor_schedule                                      group by                                          course_name, instructor_name)  inner);​

Which clearly isn’t going to work due to instructor_schedule_id being present, since we can’t select a column without including it in the group by or passing it to an aggregate function.

My question is, how can I get the primary key associated with the row where min(start_date) was obtained?

Here is some DDL for reproducing:

drop table if exists instructor_schedule;

create table instructor_schedule
(
    instructor_schedule_id bigserial primary key,
    instructor_name varchar(255),
    course_name varchar(255),
    start_date timestamp with time zone
);

alter table instructor_schedule add constraint no_dupes unique(instructor_name, course_name, start_date);

insert into instructor_schedule (instructor_name, course_name, start_date) 
values
('bob', 'databases', '2015-01-01 00:00:00.000000 +00:00'),
('bob', 'databases', '2018-01-01 00:00:00.000000 +00:00'),
('bob', 'databases', '2024-01-01 00:00:00.000000 +00:00'),
('alice', 'databases', '2021-01-01 00:00:00.000000 +00:00'),
('alice', 'databases', '2022-01-01 00:00:00.000000 +00:00');

 
drop table if exists instructor_schedule;​create table instructor_schedule(    instructor_schedule_id bigserial primary key,    instructor_name varchar(255),    course_name varchar(255),    start_date timestamp with time zone);​alter table instructor_schedule add constraint no_dupes unique(instructor_name, course_name, start_date);​insert into instructor_schedule (instructor_name, course_name, start_date) values('bob', 'databases', '2015-01-01 00:00:00.000000 +00:00'),('bob', 'databases', '2018-01-01 00:00:00.000000 +00:00'),('bob', 'databases', '2024-01-01 00:00:00.000000 +00:00'),('alice', 'databases', '2021-01-01 00:00:00.000000 +00:00'),('alice', 'databases', '2022-01-01 00:00:00.000000 +00:00');​

Answer

thanks for the precise scripts, this helps.

Here is one way to accomplish what you are looking for. first step is to filter out for all dates prior to now()(ie ignore future dates) ,we can then we can rank them on the basis of the latest entry available, lastly we filter out on the entries where rnk=1 which will pick up those entries that are the latest.

with data
  as (
     select *,row_number() over(partition by instructor_name,course_name order by start_date desc) as rnk
       from instructor_schedule
      where start_date<=now()  --restrict records to entries which have occured in the past
     )
select *
  from data x
 where x.rnk=1 --as we are ranking using the earliest start_date rnk=1, gets you the latest entry record.

 
with data  as (     select *,row_number() over(partition by instructor_name,course_name order by start_date desc) as rnk       from instructor_schedule      where start_date<=now()  --restrict records to entries which have occured in the past     )select *  from data x where x.rnk=1 --as we are ranking using the earliest start_date rnk=1, gets you the latest entry record.  ​

In your original query, you can accomplish the results you are looking for as below.

select orig.* 
  from instructor_schedule orig 
  join (select max(start_date) as max_start_date,
                   course_name,
                   instructor_name
              from instructor_schedule as is1
             where start_date<=now()             
          group by course_name, 
                   instructor_name
        ) inner1
    on orig.course_name=inner1.course_name
   and orig.instructor_name=inner1.instructor_name
   and orig.start_date=inner1.max_start_date

 
select orig.*   from instructor_schedule orig   join (select max(start_date) as max_start_date,                   course_name,                   instructor_name              from instructor_schedule as is1             where start_date<=now()                       group by course_name,                    instructor_name        ) inner1    on orig.course_name=inner1.course_name   and orig.instructor_name=inner1.instructor_name   and orig.start_date=inner1.max_start_date​

proof of example

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=09b536a0495ab8786cc2cd0dbf8077ed

Advertisement

Answer