Should SQL ranking functionality be considered as “use with caution”

This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.

Any common RDBMS includes some ranking functionality, i.e. its query language has elements like TOP n ... ORDER BY key, ROW_NUMBER() OVER (ORDER BY key), or ORDER BY key LIMIT n (overview).

They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key is not unique results are non-deterministic. Consider the following example:

users

user_id name
1       John
2       Paul
3       George
4       Ringo

logins

login_id user_id login_date
1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20

​x
 
users​user_id name1       John2       Paul3       George4       Ringo​logins​login_id user_id login_date1        4       2009-08-172        1       2009-08-183        2       2009-08-194        3       2009-08-20​

A query is supposed to return the person who logged in last:

SELECT TOP 1 users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

 
SELECT TOP 1 users.*FROM  logins JOIN  users ON logins.user_id = users.user_idORDER BY logins.login_date DESC​

Just as expected George is returned and everything looks fine. But then a new record is inserted into logins table:

1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20
5        4       2009-08-20

 
1        4       2009-08-172        1       2009-08-183        2       2009-08-194        3       2009-08-205        4       2009-08-20​

What does the query above return now? Ringo? George? You can’t tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can’t decide unambiguously from the data available.

So this query matches the requirements:

SELECT users.*
FROM
  logins JOIN
  users ON
    logins.user_id = users.user_id AND
    logins.login_date = (
      SELECT max(logins.login_date)
      FROM
        logins JOIN
        users ON logins.user_id = users.user_id)

 
SELECT users.*FROM  logins JOIN  users ON    logins.user_id = users.user_id AND    logins.login_date = (      SELECT max(logins.login_date)      FROM        logins JOIN        users ON logins.user_id = users.user_id)​

As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key (suggested by gbn), RANK, and DENSE_RANK for this very purpose).

If you search SO for e.g. ROW_NUMBER you’ll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.

Question: What advice should be given if a solution that includes ranking functionality is proposed?

Answer

This is the summary:

Use your head first. Should be obvious, but it is always a good point to start. Do you expect n rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you’re expecting n rows exactly, your model might be designed poorly if it’s impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results.
Add columns to key that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out.
Consider using possibly more suitable functions like RANK, DENSE_RANK and TOP n WITH TIES. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.

Advertisement

Answer