Skip to content
Advertisement

MySQL Get Change From Cumulative Results in Consecutive Rows by Identifier

I am running MySQL Community Server version 8.0.19.

I have been struggling with the following problem whilst working through publicly available COVID19 data. I am using a dataset that is both reliable and of good quality, however the data (total_confirmed) is reported using cumulative totals instead of daily counts of infections:

+----------------+---------------------+-----------------+
| country_region | date                | total_confirmed |
+----------------+---------------------+-----------------+
| Afghanistan    | 2020-04-05 00:00:00 |             349 |
| Afghanistan    | 2020-04-06 00:00:00 |             367 |
| Afghanistan    | 2020-04-07 00:00:00 |             423 |
| Albania        | 2020-04-05 00:00:00 |             361 |
| Albania        | 2020-04-06 00:00:00 |             377 |
| Albania        | 2020-04-07 00:00:00 |             383 |
| Algeria        | 2020-04-05 00:00:00 |            1320 |
| Algeria        | 2020-04-06 00:00:00 |            1423 |
| Algeria        | 2020-04-07 00:00:00 |            1468 |
+----------------+---------------------+-----------------+

My requirement is to have both the cumulative count and the daily new cases. There is an excellent solution for doing this here and it works like a charm on my dataset provided that I focus on one country only (I used a table populated with Afghanistan data only in this example):

SET @prev := NULL;

SELECT country_region
      ,`date` AS DateCreated
      ,total_confirmed - coalesce(@prev, total_confirmed) AS new_cases
      ,(@prev := total_confirmed) AS total_confirmed
FROM (
        SELECT * FROM so_confirmed ORDER BY `date`
     ) t1
GROUP BY
     country_region, total_confirmed, `date`
ORDER BY country_region, DateCreated;

Output:

+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated         | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan    | 2020-04-05 00:00:00 |         0 |             349 |
| Afghanistan    | 2020-04-06 00:00:00 |        18 |             367 |
| Afghanistan    | 2020-04-07 00:00:00 |        56 |             423 |
+----------------+---------------------+-----------+-----------------+

However, the minute more than one country_region exists in the data, it completely fails and I do not know SQL quite well enough to figure out what I need to change.

+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated         | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan    | 2020-04-05 00:00:00 |         0 |             349 |
| Afghanistan    | 2020-04-06 00:00:00 |      -953 |             367 |
| Afghanistan    | 2020-04-07 00:00:00 |     -1000 |             423 |
| Albania        | 2020-04-05 00:00:00 |        12 |             361 |
| Albania        | 2020-04-06 00:00:00 |        10 |             377 |
| Albania        | 2020-04-07 00:00:00 |       -40 |             383 |
| Algeria        | 2020-04-05 00:00:00 |       959 |            1320 |
| Algeria        | 2020-04-06 00:00:00 |      1046 |            1423 |
| Algeria        | 2020-04-07 00:00:00 |      1085 |            1468 |
+----------------+---------------------+-----------+-----------------+

Desired Output:

+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated         | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan    | 2020-04-05 00:00:00 |         0 |             349 |
| Afghanistan    | 2020-04-06 00:00:00 |        18 |             367 |
| Afghanistan    | 2020-04-07 00:00:00 |        56 |             423 |
| Albania        | 2020-04-05 00:00:00 |         0 |             361 |
| Albania        | 2020-04-06 00:00:00 |        16 |             377 |
| Albania        | 2020-04-07 00:00:00 |         6 |             383 |
| Algeria        | 2020-04-05 00:00:00 |         0 |            1320 |
| Algeria        | 2020-04-06 00:00:00 |       103 |            1423 |
| Algeria        | 2020-04-07 00:00:00 |        45 |            1468 |
+----------------+---------------------+-----------+-----------------+

Any assistance would be greatly appreciated. Obviously in a real-world dataset the new_cases values wouldn’t be 0 on 2020-04-05, but in this sample dataset that would be correct.

Advertisement

Answer

If you are running MySQL 8.0, you can do this with window function lag():

select
    sc.*,
    coalesce(
        total_confirmed - lag(total_confirmed) over(partition by country_region order by datecreated),
        0
    ) new_cases
from so_confirmed sc;
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement