I have been working on some database queries and am running into some issues with trying to obtain a primary key for a row where an aggregated value was obtained. I am using Postgresql v10.
Take for example the following table:
instructor_schedule_id | instructor_name | course_name | start_date |
---|---|---|---|
1 | bob | databases | 2015-01-01 00:00:00.000000 +00:00 |
2 | bob | databases | 2018-01-01 00:00:00.000000 +00:00 |
3 | bob | databases | 2024-01-01 00:00:00.000000 +00:00 |
4 | alice | databases | 2021-01-01 00:00:00.000000 +00:00 |
5 | alice | databases | 2022-01-01 00:00:00.000000 +00:00 |
What I would like to obtain, is the “current” value for each instructor/course combination, where current is determined by finding the most recent record from the past.
My desired result set would look like this:
instructor_schedule_id | instructor_name | course_name | start_date |
---|---|---|---|
2 | bob | databases | 2018-01-01 00:00:00.000000 +00:00 |
5 | alice | databases | 2022-01-01 00:00:00.000000 +00:00 |
As you can see, we have the most recent assignment to each instructor for each of their courses (only one course in this example, but there are typically many). bobs 2024 assignment hasn’t happened yet, and bobs 2015 assignment has expired since we now have a 2018 assignment. Same for alice.
Looking at this, what I WANT to do is:
select * from instructor_schedule where instructor_schedule_id in (select instructor_schedule_id from (select min(start_date), course_name, instructor_name, instructor_schedule_id from instructor_schedule group by course_name, instructor_name) inner);
Which clearly isn’t going to work due to instructor_schedule_id
being present, since we can’t select a column without including it in the group by or passing it to an aggregate function.
My question is, how can I get the primary key associated with the row where min(start_date)
was obtained?
Here is some DDL for reproducing:
drop table if exists instructor_schedule; create table instructor_schedule ( instructor_schedule_id bigserial primary key, instructor_name varchar(255), course_name varchar(255), start_date timestamp with time zone ); alter table instructor_schedule add constraint no_dupes unique(instructor_name, course_name, start_date); insert into instructor_schedule (instructor_name, course_name, start_date) values ('bob', 'databases', '2015-01-01 00:00:00.000000 +00:00'), ('bob', 'databases', '2018-01-01 00:00:00.000000 +00:00'), ('bob', 'databases', '2024-01-01 00:00:00.000000 +00:00'), ('alice', 'databases', '2021-01-01 00:00:00.000000 +00:00'), ('alice', 'databases', '2022-01-01 00:00:00.000000 +00:00');
Advertisement
Answer
thanks for the precise scripts, this helps.
Here is one way to accomplish what you are looking for. first step is to filter out for all dates prior to now()(ie ignore future dates) ,we can then we can rank them on the basis of the latest entry available, lastly we filter out on the entries where rnk=1 which will pick up those entries that are the latest.
with data as ( select *,row_number() over(partition by instructor_name,course_name order by start_date desc) as rnk from instructor_schedule where start_date<=now() --restrict records to entries which have occured in the past ) select * from data x where x.rnk=1 --as we are ranking using the earliest start_date rnk=1, gets you the latest entry record.
In your original query, you can accomplish the results you are looking for as below.
select orig.* from instructor_schedule orig join (select max(start_date) as max_start_date, course_name, instructor_name from instructor_schedule as is1 where start_date<=now() group by course_name, instructor_name ) inner1 on orig.course_name=inner1.course_name and orig.instructor_name=inner1.instructor_name and orig.start_date=inner1.max_start_date
proof of example
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=09b536a0495ab8786cc2cd0dbf8077ed