I have (for example’s sake) a table Users (user_id, status, timestamp, ...)
.
I also have another table SpecialUsers (user_id, ...)
.
I need to show each special user’s latest status.
The problem is that the Users
table is VERY, VERY LARGE (more than 50 Billion rows). Most of the solutions in for instance this question just hang or get “disk full” error.
SpecialUsers
table is much smaller – “only” 600K rows.
SELECT DISTINCT ON()
is not supported. Working on Amazon RedShift.
EDIT: per request to see the failed attempts – one of those resulting in the disk full error is like this:
with users_with_status (user_id, status, timestamp) as ( select su.user_id, u.instance_type, u.timestamp from specialusers su join users u on su.user_id = u.user_id ) select l.instance_id, l.instance_type from users_with_status l left outer join users_with_status r on l.user_id = r.user_id and l.timestamp < r.timestamp where r.timestamp is null;
I know that I’m joining a bug table with itself but was hoping that the first join with small table would reduce the number of processed rows.
Anyway, seems that window functions is the solution here.
Advertisement
Answer
Perhaps a join
with a window function will work:
select su.* from (select s.user_id, u.status, u.timestamp, max(u.timestamp) over (partition by s.user_id) as max_timestamp from specialusers s join users u on s.user_id = u.user_id ) su where timestamp = max_timestamp;
This specifically uses max()
instead of row_number()
on the speculation that it might use slightly fewer resources.