I am running MySQL Community Server version 8.0.19.
I have been struggling with the following problem whilst working through publicly available COVID19 data. I am using a dataset that is both reliable and of good quality, however the data (total_confirmed) is reported using cumulative totals instead of daily counts of infections:
+----------------+---------------------+-----------------+
| country_region | date | total_confirmed |
+----------------+---------------------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 423 |
| Albania | 2020-04-05 00:00:00 | 361 |
| Albania | 2020-04-06 00:00:00 | 377 |
| Albania | 2020-04-07 00:00:00 | 383 |
| Algeria | 2020-04-05 00:00:00 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1468 |
+----------------+---------------------+-----------------+
My requirement is to have both the cumulative count and the daily new cases. There is an excellent solution for doing this here and it works like a charm on my dataset provided that I focus on one country only (I used a table populated with Afghanistan data only in this example):
SET @prev := NULL;
SELECT country_region
,`date` AS DateCreated
,total_confirmed - coalesce(@prev, total_confirmed) AS new_cases
,(@prev := total_confirmed) AS total_confirmed
FROM (
SELECT * FROM so_confirmed ORDER BY `date`
) t1
GROUP BY
country_region, total_confirmed, `date`
ORDER BY country_region, DateCreated;
Output:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
+----------------+---------------------+-----------+-----------------+
However, the minute more than one country_region exists in the data, it completely fails and I do not know SQL quite well enough to figure out what I need to change.
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | -953 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | -1000 | 423 |
| Albania | 2020-04-05 00:00:00 | 12 | 361 |
| Albania | 2020-04-06 00:00:00 | 10 | 377 |
| Albania | 2020-04-07 00:00:00 | -40 | 383 |
| Algeria | 2020-04-05 00:00:00 | 959 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1046 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1085 | 1468 |
+----------------+---------------------+-----------+-----------------+
Desired Output:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
| Albania | 2020-04-05 00:00:00 | 0 | 361 |
| Albania | 2020-04-06 00:00:00 | 16 | 377 |
| Albania | 2020-04-07 00:00:00 | 6 | 383 |
| Algeria | 2020-04-05 00:00:00 | 0 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 103 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 45 | 1468 |
+----------------+---------------------+-----------+-----------------+
Any assistance would be greatly appreciated. Obviously in a real-world dataset the new_cases values wouldn’t be 0 on 2020-04-05, but in this sample dataset that would be correct.
Advertisement
Answer
If you are running MySQL 8.0, you can do this with window function lag()
:
select
sc.*,
coalesce(
total_confirmed - lag(total_confirmed) over(partition by country_region order by datecreated),
0
) new_cases
from so_confirmed sc;