I am running MySQL Community Server version 8.0.19.
I have been struggling with the following problem whilst working through publicly available COVID19 data. I am using a dataset that is both reliable and of good quality, however the data (total_confirmed) is reported using cumulative totals instead of daily counts of infections:
+----------------+---------------------+-----------------+ | country_region | date | total_confirmed | +----------------+---------------------+-----------------+ | Afghanistan | 2020-04-05 00:00:00 | 349 | | Afghanistan | 2020-04-06 00:00:00 | 367 | | Afghanistan | 2020-04-07 00:00:00 | 423 | | Albania | 2020-04-05 00:00:00 | 361 | | Albania | 2020-04-06 00:00:00 | 377 | | Albania | 2020-04-07 00:00:00 | 383 | | Algeria | 2020-04-05 00:00:00 | 1320 | | Algeria | 2020-04-06 00:00:00 | 1423 | | Algeria | 2020-04-07 00:00:00 | 1468 | +----------------+---------------------+-----------------+
My requirement is to have both the cumulative count and the daily new cases. There is an excellent solution for doing this here and it works like a charm on my dataset provided that I focus on one country only (I used a table populated with Afghanistan data only in this example):
SET @prev := NULL; SELECT country_region ,`date` AS DateCreated ,total_confirmed - coalesce(@prev, total_confirmed) AS new_cases ,(@prev := total_confirmed) AS total_confirmed FROM ( SELECT * FROM so_confirmed ORDER BY `date` ) t1 GROUP BY country_region, total_confirmed, `date` ORDER BY country_region, DateCreated;
Output:
+----------------+---------------------+-----------+-----------------+ | country_region | DateCreated | new_cases | total_confirmed | +----------------+---------------------+-----------+-----------------+ | Afghanistan | 2020-04-05 00:00:00 | 0 | 349 | | Afghanistan | 2020-04-06 00:00:00 | 18 | 367 | | Afghanistan | 2020-04-07 00:00:00 | 56 | 423 | +----------------+---------------------+-----------+-----------------+
However, the minute more than one country_region exists in the data, it completely fails and I do not know SQL quite well enough to figure out what I need to change.
+----------------+---------------------+-----------+-----------------+ | country_region | DateCreated | new_cases | total_confirmed | +----------------+---------------------+-----------+-----------------+ | Afghanistan | 2020-04-05 00:00:00 | 0 | 349 | | Afghanistan | 2020-04-06 00:00:00 | -953 | 367 | | Afghanistan | 2020-04-07 00:00:00 | -1000 | 423 | | Albania | 2020-04-05 00:00:00 | 12 | 361 | | Albania | 2020-04-06 00:00:00 | 10 | 377 | | Albania | 2020-04-07 00:00:00 | -40 | 383 | | Algeria | 2020-04-05 00:00:00 | 959 | 1320 | | Algeria | 2020-04-06 00:00:00 | 1046 | 1423 | | Algeria | 2020-04-07 00:00:00 | 1085 | 1468 | +----------------+---------------------+-----------+-----------------+
Desired Output:
+----------------+---------------------+-----------+-----------------+ | country_region | DateCreated | new_cases | total_confirmed | +----------------+---------------------+-----------+-----------------+ | Afghanistan | 2020-04-05 00:00:00 | 0 | 349 | | Afghanistan | 2020-04-06 00:00:00 | 18 | 367 | | Afghanistan | 2020-04-07 00:00:00 | 56 | 423 | | Albania | 2020-04-05 00:00:00 | 0 | 361 | | Albania | 2020-04-06 00:00:00 | 16 | 377 | | Albania | 2020-04-07 00:00:00 | 6 | 383 | | Algeria | 2020-04-05 00:00:00 | 0 | 1320 | | Algeria | 2020-04-06 00:00:00 | 103 | 1423 | | Algeria | 2020-04-07 00:00:00 | 45 | 1468 | +----------------+---------------------+-----------+-----------------+
Any assistance would be greatly appreciated. Obviously in a real-world dataset the new_cases values wouldn’t be 0 on 2020-04-05, but in this sample dataset that would be correct.
Advertisement
Answer
If you are running MySQL 8.0, you can do this with window function lag()
:
select sc.*, coalesce( total_confirmed - lag(total_confirmed) over(partition by country_region order by datecreated), 0 ) new_cases from so_confirmed sc;