Skip to content
Advertisement

Is this a perfect SQL query to fetch random items

I have this below query to fetch books of a random author and order by the books last modified date. Because I am new to SQL something tells me this query can be improved and simplified. I would appreciate the help of the SQL experts, if this is already simplified then I will consider myself an SQL expert :).

SELECT b.id, bc.author FROM book_authors bc
INNER JOIN books b ON bc.book = b.id 
WHERE bc.author = (SELECT author FROM book_authors ORDER BY random() limit 1) AND b.status = 'GOOD'
GROUP BY b.id, bc.author  
ORDER BY MAX(b.modified_date) DESC
LIMIT 10 OFFSET 0 -- for pagination purposes

Table structures

book_authors
book       author
1            1
2            3
3            1

books        
id    status     modified_date
1      GOOD       01-01-2010
2       GOOD      02-01-2010
3       GOOD      03-01-2010

authors
id
1
2
3 


looking for a output like

authorId               bookId     
1 (a random author)       3
1 (same random author)    1

Advertisement

Answer

You query is not taking a random sample of authors. It is taking a random sample from book_authors, which is different. In particular, authors with more books are more likely to be chosen. Consider:

author_id    book_id
    1           1
    1           2
    1           3
    2           4

A “random” sample from this table is going to choose 1 three times more often than 2. Such biased results are not how I interpret “fetch books of a random author”.

So, you should be doing the sampling from the authors table. The structure of your query is reasonable (see below), although no aggregation is needed:

SELECT b.id, bc.author
FROM book_authors bc INNER JOIN
     books b
     ON bc.book = b.id 
WHERE bc.author = (SELECT a.id FROM authors id ORDER BY random() LIMIT 1) AND
      b.status = 'GOOD'
ORDER BY b.modified_date DESC;

Now, whether this actually does what you want is interesting. As written, Postgres has two ways of running this:

  • Running the subquery once for the query. That is getting one random author and using it throughout.
  • Running the subquery once per comparison in the WHERE.

The subquery would usual return different results each time it is run (technically, the subquery is “nondeterministic”). So, the second method would have a different random author for each comparison — and that is not the results you intend. In practice, the Postgres optimizer (I think) ignores the indeterminate nature of the subquery and executes it only once.

To fix this potential problem, it is better to move the logic to the FROM clause; then it is evaluated only once:

SELECT b.id, bc.author
FROM (SELECT a.id
      FROM authors id
      ORDER BY random()
      LIMIT 1
     ) a INNER JOIN
     book_authors bc
     ON bc.author = a.id INNER JOIN
     books b
     ON bc.book = b.id 
WHERE b.status = 'GOOD'
ORDER BY b.modified_date DESC;

Note: The use of ORDER BY random() LIMIT 1 to fetch one random row works. However, for anything but small tables it is quite expensive performance-wise. I wouldn’t get in the habit of using that.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement