Skip to content
Advertisement

How to use GROUP BY which takes into account two columns?

I have a message table like this in MySQL.

   +--------------------+--------------+------+-----+---------------------+----------------+
    | Field              | Type         | Null | Key | Default             | Extra          |
    +--------------------+--------------+------+-----+---------------------+----------------+
    | id                 | int(11)      | NO   | PRI | NULL                | auto_increment |
    | subject            | varchar(120) | NO   |     | NULL                |                |
    | body               | longtext     | NO   |     | NULL                |                |
    | sent_at            | datetime     | YES  |     | NULL                |                |
    | recipient_read     | tinyint(1)   | NO   |     | 0                   |                |
    | recipient_id       | int(11)      | NO   | MUL | 0                   |                |
    | sender_id          | int(11)      | NO   | MUL | 0                   |                |
    | thread_id          | int(11)      | NO   | MUL | 0                   |                |
    +--------------------+--------------+------+-----+---------------------+----------------+

Messages in a recepient’s inbox are to be grouped by thread_id like this:

SELECT * FROM message WHERE recipient_id=42  GROUP BY thread_id ORDER BY sent_at DESC

My problem is how to take recipient_read into account so that each row in the result also show what is the recipient_read value of the last message in the thread?

Advertisement

Answer

In the original query, the ORDER BY is only satisfied after the GROUP BY operation. The ORDER BY affects the order of the returned rows. It does not influence which rows are returned.

With the non-aggregate expression in the SELECT list, it is indeterminate which values will be returned; the value of each column will be from some row in the collapsed group. But it’s not guaranteed to be the first row, or the latest row, or any other specific row. The behavior of MySQL (allowing the query to run without throwing an error) is enabled by a MySQL extension.

Other relational databases would throw a “non-aggregate in SELECT list not in GROUP BY” type error with the query. MySQL exhibits a similar (standard) behavior when ONLY_FULL_GROUP_BY is included in sql_mode system variable. MySQL allows the original query to run (and return unexpected results) because of a non-standard, MySQL-specific extension.

The pattern of the original query is essentially broken.


To get a resultset that satisfies the specification, we can write a query to get the latest (maximum) sent_at datetime for each thread_id, for a given set of recipient_id (in the example query, the set is a single recipient_id.)

SELECT lm.recipient_id
     , lm.thread_id
     , MAX(lm.sent_at) AS latest_sent_at
  FROM message lm
 WHERE lm.recipient_id = 42
 GROUP
    BY lm.recipient_id
     , lm.thread_id

We can use the result from that query in another query, by making in an inline view (wrap it in parens, and reference it in the FROM clause like table, assign an alias).

We can join that resultset to the original table to retrieve all of the columns from the rows that match.

Something like this:

SELECT m.id
     , m.subject
     , m.body
     , m.sent_at
     , m.recipient_read
     , m.recipient_id
     , m.sender_id
     , m.thread_id
  FROM (
         SELECT lm.recipient_id
              , lm.thread_id
              , MAX(lm.sent_at) AS latest_sent_at
           FROM message lm
          WHERE lm.recipient_id = 42
          GROUP
             BY lm.recipient_id
              , lm.thread_id
       ) l
  JOIN message m
    ON m.recipient_id = l.recipient_id
   AND m.thread_id    = l.thread_id
   AND m.sent_at      = l.latest_sent_at
 ORDER
    BY ...

Note that if (recipient_id,thread_id,sent_at) is not guaranteed to be unique, there is a potential that there will be multiple rows with the same “maximum” sent_at; that is, we could get more than one row back for a given maximum sent_at.

We can order that result however we want, with whatever expressions. That will affect only the order that the rows are returned in, not which rows are returned.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement