I have this below query to fetch books of a random author and order by the books last modified date. Because I am new to SQL something tells me this query can be improved and simplified. I would appreciate the help of the SQL experts, if this is already simplified then I will consider myself an SQL expert :).
SELECT b.id, bc.author FROM book_authors bc INNER JOIN books b ON bc.book = b.id WHERE bc.author = (SELECT author FROM book_authors ORDER BY random() limit 1) AND b.status = 'GOOD' GROUP BY b.id, bc.author ORDER BY MAX(b.modified_date) DESC LIMIT 10 OFFSET 0 -- for pagination purposes
Table structures
book_authors book author 1 1 2 3 3 1 books id status modified_date 1 GOOD 01-01-2010 2 GOOD 02-01-2010 3 GOOD 03-01-2010 authors id 1 2 3
looking for a output like
authorId bookId 1 (a random author) 3 1 (same random author) 1
Advertisement
Answer
You query is not taking a random sample of authors. It is taking a random sample from book_authors
, which is different. In particular, authors with more books are more likely to be chosen. Consider:
author_id book_id 1 1 1 2 1 3 2 4
A “random” sample from this table is going to choose 1
three times more often than 2
. Such biased results are not how I interpret “fetch books of a random author”.
So, you should be doing the sampling from the authors
table. The structure of your query is reasonable (see below), although no aggregation is needed:
SELECT b.id, bc.author FROM book_authors bc INNER JOIN books b ON bc.book = b.id WHERE bc.author = (SELECT a.id FROM authors id ORDER BY random() LIMIT 1) AND b.status = 'GOOD' ORDER BY b.modified_date DESC;
Now, whether this actually does what you want is interesting. As written, Postgres has two ways of running this:
- Running the subquery once for the query. That is getting one random author and using it throughout.
- Running the subquery once per comparison in the
WHERE
.
The subquery would usual return different results each time it is run (technically, the subquery is “nondeterministic”). So, the second method would have a different random author for each comparison — and that is not the results you intend. In practice, the Postgres optimizer (I think) ignores the indeterminate nature of the subquery and executes it only once.
To fix this potential problem, it is better to move the logic to the FROM
clause; then it is evaluated only once:
SELECT b.id, bc.author FROM (SELECT a.id FROM authors id ORDER BY random() LIMIT 1 ) a INNER JOIN book_authors bc ON bc.author = a.id INNER JOIN books b ON bc.book = b.id WHERE b.status = 'GOOD' ORDER BY b.modified_date DESC;
Note: The use of ORDER BY random() LIMIT 1
to fetch one random row works. However, for anything but small tables it is quite expensive performance-wise. I wouldn’t get in the habit of using that.