I’m working alongside a SQL tutorial using queryparser and tidyquery in R. This has been going well until I was asked to do:
SELECT ... SUM(new_vaccinations) OVER (PARTITION BY location) as vaccinations_to_date
Tidyquery reported that it did not support OVER
functions so I am trying to replicate the OVER (PARTITION BY...)
function with dplyr.
This led me to with_order(order_by =...
in dplyr. Now I’m struggling with getting the fun =
to allow me to create a cumulative sum column.
library(tidyverse) library(queryparser) library(tidyquery) mydf <- data.frame(date = as.Date(c("2021-06-01", '2021-06-02','2021-06-03','2021-06-04', '2021-6-01','2021-6-02','2021-6-03','2021-6-04', '2021-6-01','2021-6-02','2021-6-03','2021-6-04')), location = c('United States','United States','United States','United States', 'Canada','Canada','Canada','Canada','Mexico','Mexico','Mexico','Mexico'), new_vaccinations = c(100,98,32,50,99,34,97,53,35,101,97,56)) test <- mydf %>% mutate (total_vax = with_order(order_by = location, fun = cumsum(new_vaccinations), x = desc(location)))
This gives me the error
could not find function "fun"
Am I looking down the wrong rabbit hole when it comes to how to recreating OVER(PARTITION BY...)
? If so what is a better option?
Or am I missing how to properly use with_order(order_by =...)
?
If it is not clear, my goal is to create a new column that keeps a running total of vaccinations for each separate location.
Advertisement
Answer
The PARTITION BY
aspect of SQL can often be done in dplyr using group_by
.
And the ORDER BY
aspect of SQL can often be done in dplyr using arrange
.
Consider this R code:
library(dplyr) data(mtcars) mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg) %>% mutate(new = cumsum(mpg)) %>% arrange(cyl, mpg) %>% select(cyl, mpg, new)
Is equivalent to this SQL:
SELECT cyl ,mpg ,SUM(mpg) OVER (PARTITON BY cyl ORDER BY mpg) AS new FROM mtcars