Skip to content
Advertisement

With MySQL, how to calculate a MEDIAN with Left Join records?

Given the following two tables, I’d like to know how to calculate the MEDIAN rating for reviews on a weekly basis.

reviews

(id, user_id, completed_at)

reviews.completed_at -- lets us know the user submitted the review, it's not a draft.

reviews_areas

reviews_areas (created_at, review_id, rating)

reviews_areas.rating = INT between 0…10)

Example Data:

reviews:

+----+---------+---------------------+
| id | user_id |    completed_at     |
+----+---------+---------------------+
|  1 |     100 | 2019-07-20 11:34:40 |
|  2 |     100 | 2019-07-22 11:34:40 |
|  3 |     500 | 2019-07-30 16:34:40 |
+----+---------+---------------------+

reviews_areas:

+------------+-----------+--------+
| created_at | review_id | rating |
+------------+-----------+--------+
| 1:34:40    |         1 |      0 |
| 12:34:40   |         1 |      5 |
| 11:34:40   |         1 |     10 |
| 5:34:40    |         1 |      9 |
| 6:34:40    |         2 |      1 |
| 1:34:40    |         2 |      5 |
| 2:32:40    |         3 |      5 |
+------------+-----------+--------+

Advertisement

Answer

The problem is not clear. I’ll use reviews.completed_at for the date because reviews_areas.created_at contains just a time.

We need to join reviews for the date with reviews_areas for the rating.

To avoid the same week on different years from overlapping, we use yearweek to turn dates into the year + week.

To get the median we need to find the middle row (or rows if there’s an even number) for each week. There’s many ways to do this. I’ll crib from Justin Grant’s clever answer. We count the row_number() ordered ascending and descending. Where they overlap +/- 1 are the median rows. Then we average them.

1 2 3 4 5 6
6 5 4 3 2 1
    ^^^
    median rows

First, we get the row numbers by week.

select
    yearweek(completed_at) as week,
    rating,
    row_number() over(
        partition by yearweek(completed_at)
        order by rating asc, id asc
    ) as row_asc,
    row_number() over(
        partition by yearweek(completed_at)
        order by rating desc, id desc
    ) as row_desc
from reviews_areas ra
join reviews r on r.id = ra.review_id

The row numbers are ordered by order by rating asc, id asc. The id is a secondary sort to disambiguate rows with the same rating.

+--------+--------+---------+----------+
| week   | rating | row_asc | row_desc |
+--------+--------+---------+----------+
| 201928 |     10 |       4 |        1 |
| 201928 |      9 |       3 |        2 |
| 201928 |      5 |       2 |        3 |
| 201928 |      0 |       1 |        4 |
| 201929 |      5 |       2 |        1 |
| 201929 |      1 |       1 |        2 |
| 201930 |      5 |       1 |        1 |
+--------+--------+---------+----------+

Then we use that as a common table expression to average the middle rows of each week. A subquery works just as well.

with rating_weeks as (
    select
        yearweek(completed_at) as week,
        rating,
        row_number() over(
            partition by yearweek(completed_at)
            order by rating asc, id asc
        ) as row_asc,
        row_number() over(
            partition by yearweek(completed_at)
            order by rating desc, id desc
        ) as row_desc
    from reviews_areas ra
    join reviews r on r.id = ra.review_id
)
select
    week,
    -- Take the average of the possibly 2 median rows
    avg(rating)
from rating_weeks
where
    -- Find the rows which overlap +/- 1. These are the median rows.
    row_asc in (row_desc, row_desc - 1, row_desc + 1)
group by week
order by week
+--------+-------------+
| week   | avg(rating) |
+--------+-------------+
| 201928 |      7.0000 |
| 201929 |      3.0000 |
| 201930 |      5.0000 |
+--------+-------------+
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement