Skip to content
Advertisement

Which one is quicker/optimized – Inner Join or Partition By – to obtain Aggregated data?

In my data ‘table1’, there are multiple records for each app_id. I’m trying to get a latest app_id entry. Which one of the below queries would be quick or better to consider in terms of performance and memory usage? Currently, I work on AWS server and use PostgreSQL.

Query 1:

SELECT b.* 
FROM 
    (SELECT app_id, max(datetime) as datetime
    from table1
    group by 1) a
INNER JOIN 
    (SELECT * 
    from table1) b
ON a.app_id = b.app_id AND a.datetime = b.datetime

Query 2:

SELECT * 
FROM 
    (SELECT *, row_number() over (partition by app_id order by datetime DESC 
    NULLS LAST) as num FROM table1) sub
WHERE sub.num=1

Advertisement

Answer

The right answer to “which is faster” is to try the queries on your data and your systems.

That said, there are some considerations in favor of row_number(). In particular, window functions are not an “accidental” feature in databases. Adding a new string function is just a function and the function may or may not be optimized.

On the other hand, window functions required rewriting/redesigning some fundamental components of the database engine. In general, this was done with performance in mind. So, I usually find that window functions are faster than equivalent constructs.

The only exception that I regularly find (across databases) ironically applies in your case. And, it is not using the join and group by. Instead it is:

select t1.*
from table1 t1
where t1.datetime = (select max(tt1.datetime)
                     from table1 tt1
                     where tt1.app_id = t1.app_id
                    );

Along with an index on table1(app_id, datetime).

The basic reason for the performance improvement is that this scans table1 once and does an index lookup at each row.

The join/group by is scanning the table multiple times and the aggregation is expensive. The row_number() version scans the table (or index), calculates the value, and then brings the value back to every row — similar to two scans of the data.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement