I have this below query to fetch books of a random author and order by the books last modified date. Because I am new to SQL something tells me this query can be improved and simplified. I would appreciate the help of the SQL experts, if this is already simplified then I will consider myself an SQL expert :).
SELECT b.id, bc.author FROM book_authors bc
INNER JOIN books b ON bc.book = b.id
WHERE bc.author = (SELECT author FROM book_authors ORDER BY random() limit 1) AND b.status = 'GOOD'
GROUP BY b.id, bc.author
ORDER BY MAX(b.modified_date) DESC
LIMIT 10 OFFSET 0 -- for pagination purposes
Table structures
book_authors
book author
1 1
2 3
3 1
books
id status modified_date
1 GOOD 01-01-2010
2 GOOD 02-01-2010
3 GOOD 03-01-2010
authors
id
1
2
3
looking for a output like
authorId bookId
1 (a random author) 3
1 (same random author) 1
Advertisement
Answer
You query is not taking a random sample of authors. It is taking a random sample from book_authors
, which is different. In particular, authors with more books are more likely to be chosen. Consider:
author_id book_id
1 1
1 2
1 3
2 4
A “random” sample from this table is going to choose 1
three times more often than 2
. Such biased results are not how I interpret “fetch books of a random author”.
So, you should be doing the sampling from the authors
table. The structure of your query is reasonable (see below), although no aggregation is needed:
SELECT b.id, bc.author
FROM book_authors bc INNER JOIN
books b
ON bc.book = b.id
WHERE bc.author = (SELECT a.id FROM authors id ORDER BY random() LIMIT 1) AND
b.status = 'GOOD'
ORDER BY b.modified_date DESC;
Now, whether this actually does what you want is interesting. As written, Postgres has two ways of running this:
- Running the subquery once for the query. That is getting one random author and using it throughout.
- Running the subquery once per comparison in the
WHERE
.
The subquery would usual return different results each time it is run (technically, the subquery is “nondeterministic”). So, the second method would have a different random author for each comparison — and that is not the results you intend. In practice, the Postgres optimizer (I think) ignores the indeterminate nature of the subquery and executes it only once.
To fix this potential problem, it is better to move the logic to the FROM
clause; then it is evaluated only once:
SELECT b.id, bc.author
FROM (SELECT a.id
FROM authors id
ORDER BY random()
LIMIT 1
) a INNER JOIN
book_authors bc
ON bc.author = a.id INNER JOIN
books b
ON bc.book = b.id
WHERE b.status = 'GOOD'
ORDER BY b.modified_date DESC;
Note: The use of ORDER BY random() LIMIT 1
to fetch one random row works. However, for anything but small tables it is quite expensive performance-wise. I wouldn’t get in the habit of using that.