Given the following two tables, I’d like to know how to calculate the MEDIAN rating for reviews on a weekly basis.
reviews
(id, user_id, completed_at)
reviews.completed_at -- lets us know the user submitted the review, it's not a draft.
reviews_areas
reviews_areas (created_at, review_id, rating)
reviews_areas.rating = INT between 0…10)
Example Data:
reviews:
+----+---------+---------------------+
| id | user_id | completed_at |
+----+---------+---------------------+
| 1 | 100 | 2019-07-20 11:34:40 |
| 2 | 100 | 2019-07-22 11:34:40 |
| 3 | 500 | 2019-07-30 16:34:40 |
+----+---------+---------------------+
reviews_areas:
+------------+-----------+--------+
| created_at | review_id | rating |
+------------+-----------+--------+
| 1:34:40 | 1 | 0 |
| 12:34:40 | 1 | 5 |
| 11:34:40 | 1 | 10 |
| 5:34:40 | 1 | 9 |
| 6:34:40 | 2 | 1 |
| 1:34:40 | 2 | 5 |
| 2:32:40 | 3 | 5 |
+------------+-----------+--------+
Advertisement
Answer
The problem is not clear. I’ll use reviews.completed_at
for the date because reviews_areas.created_at
contains just a time.
We need to join reviews
for the date with reviews_areas
for the rating.
To avoid the same week on different years from overlapping, we use yearweek
to turn dates into the year + week.
To get the median we need to find the middle row (or rows if there’s an even number) for each week. There’s many ways to do this. I’ll crib from Justin Grant’s clever answer. We count the row_number()
ordered ascending and descending. Where they overlap +/- 1 are the median rows. Then we average them.
1 2 3 4 5 6
6 5 4 3 2 1
^^^
median rows
First, we get the row numbers by week.
select
yearweek(completed_at) as week,
rating,
row_number() over(
partition by yearweek(completed_at)
order by rating asc, id asc
) as row_asc,
row_number() over(
partition by yearweek(completed_at)
order by rating desc, id desc
) as row_desc
from reviews_areas ra
join reviews r on r.id = ra.review_id
The row numbers are ordered by order by rating asc, id asc
. The id
is a secondary sort to disambiguate rows with the same rating.
+--------+--------+---------+----------+
| week | rating | row_asc | row_desc |
+--------+--------+---------+----------+
| 201928 | 10 | 4 | 1 |
| 201928 | 9 | 3 | 2 |
| 201928 | 5 | 2 | 3 |
| 201928 | 0 | 1 | 4 |
| 201929 | 5 | 2 | 1 |
| 201929 | 1 | 1 | 2 |
| 201930 | 5 | 1 | 1 |
+--------+--------+---------+----------+
Then we use that as a common table expression to average the middle rows of each week. A subquery works just as well.
with rating_weeks as (
select
yearweek(completed_at) as week,
rating,
row_number() over(
partition by yearweek(completed_at)
order by rating asc, id asc
) as row_asc,
row_number() over(
partition by yearweek(completed_at)
order by rating desc, id desc
) as row_desc
from reviews_areas ra
join reviews r on r.id = ra.review_id
)
select
week,
-- Take the average of the possibly 2 median rows
avg(rating)
from rating_weeks
where
-- Find the rows which overlap +/- 1. These are the median rows.
row_asc in (row_desc, row_desc - 1, row_desc + 1)
group by week
order by week
+--------+-------------+
| week | avg(rating) |
+--------+-------------+
| 201928 | 7.0000 |
| 201929 | 3.0000 |
| 201930 | 5.0000 |
+--------+-------------+