Skip to content
Advertisement

Effectively select latest row for each group in a very large table?

I have (for example’s sake) a table Users (user_id, status, timestamp, ...).
I also have another table SpecialUsers (user_id, ...).

I need to show each special user’s latest status.

The problem is that the Users table is VERY, VERY LARGE (more than 50 Billion rows). Most of the solutions in for instance this question just hang or get “disk full” error.

SpecialUsers table is much smaller – “only” 600K rows.

SELECT DISTINCT ON() is not supported. Working on Amazon RedShift.

EDIT: per request to see the failed attempts – one of those resulting in the disk full error is like this:

I know that I’m joining a bug table with itself but was hoping that the first join with small table would reduce the number of processed rows.

Anyway, seems that window functions is the solution here.

Advertisement

Answer

Perhaps a join with a window function will work:

This specifically uses max() instead of row_number() on the speculation that it might use slightly fewer resources.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement