Given the following two tables, I’d like to know how to calculate the MEDIAN rating for reviews on a weekly basis.
reviews
(id, user_id, completed_at) reviews.completed_at -- lets us know the user submitted the review, it's not a draft.
reviews_areas
reviews_areas (created_at, review_id, rating) reviews_areas.rating = INT between 0…10)
Example Data:
reviews:
+----+---------+---------------------+ | id | user_id | completed_at | +----+---------+---------------------+ | 1 | 100 | 2019-07-20 11:34:40 | | 2 | 100 | 2019-07-22 11:34:40 | | 3 | 500 | 2019-07-30 16:34:40 | +----+---------+---------------------+
reviews_areas:
+------------+-----------+--------+ | created_at | review_id | rating | +------------+-----------+--------+ | 1:34:40 | 1 | 0 | | 12:34:40 | 1 | 5 | | 11:34:40 | 1 | 10 | | 5:34:40 | 1 | 9 | | 6:34:40 | 2 | 1 | | 1:34:40 | 2 | 5 | | 2:32:40 | 3 | 5 | +------------+-----------+--------+
Advertisement
Answer
The problem is not clear. I’ll use reviews.completed_at
for the date because reviews_areas.created_at
contains just a time.
We need to join reviews
for the date with reviews_areas
for the rating.
To avoid the same week on different years from overlapping, we use yearweek
to turn dates into the year + week.
To get the median we need to find the middle row (or rows if there’s an even number) for each week. There’s many ways to do this. I’ll crib from Justin Grant’s clever answer. We count the row_number()
ordered ascending and descending. Where they overlap +/- 1 are the median rows. Then we average them.
1 2 3 4 5 6 6 5 4 3 2 1 ^^^ median rows
First, we get the row numbers by week.
select yearweek(completed_at) as week, rating, row_number() over( partition by yearweek(completed_at) order by rating asc, id asc ) as row_asc, row_number() over( partition by yearweek(completed_at) order by rating desc, id desc ) as row_desc from reviews_areas ra join reviews r on r.id = ra.review_id
The row numbers are ordered by order by rating asc, id asc
. The id
is a secondary sort to disambiguate rows with the same rating.
+--------+--------+---------+----------+ | week | rating | row_asc | row_desc | +--------+--------+---------+----------+ | 201928 | 10 | 4 | 1 | | 201928 | 9 | 3 | 2 | | 201928 | 5 | 2 | 3 | | 201928 | 0 | 1 | 4 | | 201929 | 5 | 2 | 1 | | 201929 | 1 | 1 | 2 | | 201930 | 5 | 1 | 1 | +--------+--------+---------+----------+
Then we use that as a common table expression to average the middle rows of each week. A subquery works just as well.
with rating_weeks as ( select yearweek(completed_at) as week, rating, row_number() over( partition by yearweek(completed_at) order by rating asc, id asc ) as row_asc, row_number() over( partition by yearweek(completed_at) order by rating desc, id desc ) as row_desc from reviews_areas ra join reviews r on r.id = ra.review_id ) select week, -- Take the average of the possibly 2 median rows avg(rating) from rating_weeks where -- Find the rows which overlap +/- 1. These are the median rows. row_asc in (row_desc, row_desc - 1, row_desc + 1) group by week order by week
+--------+-------------+ | week | avg(rating) | +--------+-------------+ | 201928 | 7.0000 | | 201929 | 3.0000 | | 201930 | 5.0000 | +--------+-------------+