I have a message table like this in MySQL.
+--------------------+--------------+------+-----+---------------------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------+--------------+------+-----+---------------------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | subject | varchar(120) | NO | | NULL | | | body | longtext | NO | | NULL | | | sent_at | datetime | YES | | NULL | | | recipient_read | tinyint(1) | NO | | 0 | | | recipient_id | int(11) | NO | MUL | 0 | | | sender_id | int(11) | NO | MUL | 0 | | | thread_id | int(11) | NO | MUL | 0 | | +--------------------+--------------+------+-----+---------------------+----------------+
Messages in a recepient’s inbox are to be grouped by thread_id
like this:
SELECT * FROM message WHERE recipient_id=42 GROUP BY thread_id ORDER BY sent_at DESC
My problem is how to take recipient_read
into account so that each row in the result also show what is the recipient_read
value of the last message in the thread?
Advertisement
Answer
In the original query, the ORDER BY
is only satisfied after the GROUP BY
operation. The ORDER BY
affects the order of the returned rows. It does not influence which rows are returned.
With the non-aggregate expression in the SELECT list, it is indeterminate which values will be returned; the value of each column will be from some row in the collapsed group. But it’s not guaranteed to be the first row, or the latest row, or any other specific row. The behavior of MySQL (allowing the query to run without throwing an error) is enabled by a MySQL extension.
Other relational databases would throw a “non-aggregate in SELECT list not in GROUP BY” type error with the query. MySQL exhibits a similar (standard) behavior when ONLY_FULL_GROUP_BY
is included in sql_mode
system variable. MySQL allows the original query to run (and return unexpected results) because of a non-standard, MySQL-specific extension.
The pattern of the original query is essentially broken.
To get a resultset that satisfies the specification, we can write a query to get the latest (maximum) sent_at
datetime for each thread_id
, for a given set of recipient_id
(in the example query, the set is a single recipient_id
.)
SELECT lm.recipient_id , lm.thread_id , MAX(lm.sent_at) AS latest_sent_at FROM message lm WHERE lm.recipient_id = 42 GROUP BY lm.recipient_id , lm.thread_id
We can use the result from that query in another query, by making in an inline view (wrap it in parens, and reference it in the FROM clause like table, assign an alias).
We can join that resultset to the original table to retrieve all of the columns from the rows that match.
Something like this:
SELECT m.id , m.subject , m.body , m.sent_at , m.recipient_read , m.recipient_id , m.sender_id , m.thread_id FROM ( SELECT lm.recipient_id , lm.thread_id , MAX(lm.sent_at) AS latest_sent_at FROM message lm WHERE lm.recipient_id = 42 GROUP BY lm.recipient_id , lm.thread_id ) l JOIN message m ON m.recipient_id = l.recipient_id AND m.thread_id = l.thread_id AND m.sent_at = l.latest_sent_at ORDER BY ...
Note that if (recipient_id,thread_id,sent_at)
is not guaranteed to be unique, there is a potential that there will be multiple rows with the same “maximum” sent_at
; that is, we could get more than one row back for a given maximum sent_at
.
We can order that result however we want, with whatever expressions. That will affect only the order that the rows are returned in, not which rows are returned.