Creating a cumulative sum column with_order in R

I’m working alongside a SQL tutorial using queryparser and tidyquery in R. This has been going well until I was asked to do:

SELECT ... SUM(new_vaccinations) OVER (PARTITION BY location) as vaccinations_to_date

​x
 
SELECT ... SUM(new_vaccinations) OVER (PARTITION BY location) as vaccinations_to_date​

Tidyquery reported that it did not support OVER functions so I am trying to replicate the OVER (PARTITION BY...) function with dplyr.

This led me to with_order(order_by =... in dplyr. Now I’m struggling with getting the fun = to allow me to create a cumulative sum column.

library(tidyverse)
library(queryparser)
library(tidyquery)

mydf <- data.frame(date = as.Date(c("2021-06-01", '2021-06-02','2021-06-03','2021-06-04',
                                  '2021-6-01','2021-6-02','2021-6-03','2021-6-04',
                                  '2021-6-01','2021-6-02','2021-6-03','2021-6-04')),
                   location = c('United States','United States','United States','United States',
                                'Canada','Canada','Canada','Canada','Mexico','Mexico','Mexico','Mexico'),
                   new_vaccinations = c(100,98,32,50,99,34,97,53,35,101,97,56))

test <- mydf %>% 
  mutate (total_vax = with_order(order_by = location, fun = cumsum(new_vaccinations), x = desc(location)))

 
library(tidyverse)library(queryparser)library(tidyquery)​mydf <- data.frame(date = as.Date(c("2021-06-01", '2021-06-02','2021-06-03','2021-06-04',                                  '2021-6-01','2021-6-02','2021-6-03','2021-6-04',                                  '2021-6-01','2021-6-02','2021-6-03','2021-6-04')),                   location = c('United States','United States','United States','United States',                                'Canada','Canada','Canada','Canada','Mexico','Mexico','Mexico','Mexico'),                   new_vaccinations = c(100,98,32,50,99,34,97,53,35,101,97,56))​test <- mydf %>%   mutate (total_vax = with_order(order_by = location, fun = cumsum(new_vaccinations), x = desc(location)))​

This gives me the error

could not find function "fun"

 
could not find function "fun"​

Am I looking down the wrong rabbit hole when it comes to how to recreating OVER(PARTITION BY...)? If so what is a better option? Or am I missing how to properly use with_order(order_by =...)?

If it is not clear, my goal is to create a new column that keeps a running total of vaccinations for each separate location.

Answer

The PARTITION BY aspect of SQL can often be done in dplyr using group_by.

And the ORDER BY aspect of SQL can often be done in dplyr using arrange.

Consider this R code:

library(dplyr)
data(mtcars)

mtcars %>%
  select(mpg, cyl) %>%
  group_by(cyl) %>%
  arrange(mpg) %>%
  mutate(new = cumsum(mpg)) %>%
  arrange(cyl, mpg) %>%
  select(cyl, mpg, new)

 
library(dplyr)data(mtcars)​mtcars %>%  select(mpg, cyl) %>%  group_by(cyl) %>%  arrange(mpg) %>%  mutate(new = cumsum(mpg)) %>%  arrange(cyl, mpg) %>%  select(cyl, mpg, new)​

Is equivalent to this SQL:

SELECT cyl
    ,mpg
    ,SUM(mpg) OVER (PARTITON BY cyl ORDER BY mpg) AS new
FROM mtcars

 
SELECT cyl    ,mpg    ,SUM(mpg) OVER (PARTITON BY cyl ORDER BY mpg) AS newFROM mtcars​

Advertisement

Answer