In my data ‘table1’, there are multiple records for each app_id. I’m trying to get a latest app_id entry. Which one of the below queries would be quick or better to consider in terms of performance and memory usage? Currently, I work on AWS server and use PostgreSQL.
Query 1:
SELECT b.* FROM (SELECT app_id, max(datetime) as datetime from table1 group by 1) a INNER JOIN (SELECT * from table1) b ON a.app_id = b.app_id AND a.datetime = b.datetime
Query 2:
SELECT * FROM (SELECT *, row_number() over (partition by app_id order by datetime DESC NULLS LAST) as num FROM table1) sub WHERE sub.num=1
Advertisement
Answer
The right answer to “which is faster” is to try the queries on your data and your systems.
That said, there are some considerations in favor of row_number()
. In particular, window functions are not an “accidental” feature in databases. Adding a new string function is just a function and the function may or may not be optimized.
On the other hand, window functions required rewriting/redesigning some fundamental components of the database engine. In general, this was done with performance in mind. So, I usually find that window functions are faster than equivalent constructs.
The only exception that I regularly find (across databases) ironically applies in your case. And, it is not using the join
and group by
. Instead it is:
select t1.* from table1 t1 where t1.datetime = (select max(tt1.datetime) from table1 tt1 where tt1.app_id = t1.app_id );
Along with an index on table1(app_id, datetime)
.
The basic reason for the performance improvement is that this scans table1
once and does an index lookup at each row.
The join
/group by
is scanning the table multiple times and the aggregation is expensive. The row_number()
version scans the table (or index), calculates the value, and then brings the value back to every row — similar to two scans of the data.