Skip to content
Advertisement

Sort by one variable, group by another, and select first row in SQL Query in R

I need to apply a procedure in SQL that is easy for me since R, but has been really tortuous in SQL.

I need to sort the data from highest to lowest by two variables, group based on another variable, and select the first item in each group.

I leave the code that I am trying to pass from R to SQL. Unfortunately the dbplyr package throws me an error when trying to convert one language to another: Error: first() is only available in a windowed (mutate()) context

library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

mtcars2 <- tbl(con, "mtcars")
mtcars2

mtcars2 %>% 
  arrange(-mpg,-disp) %>% 
  group_by(cyl) %>% 
  summarise(hp = first(hp)) %>% 
  show_query()

It seems to me that the DISTINCT ON function could help me.

Thanks for your help.

Advertisement

Answer

Maybe the following?

library(tidyverse)
library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

mtcars2 <- tbl(con, "mtcars")

mtcars2 %>%
  arrange(-mpg,-disp) %>%
  group_by(cyl) %>%
  mutate(hp = first(hp)) %>% 
  select(cyl, hp) %>% 
  distinct %>% 
  show_query

#> <SQL>
#> SELECT DISTINCT `cyl`, FIRST_VALUE(`hp`) OVER (PARTITION BY `cyl` ORDER BY -`mpg`, -`disp`) AS `hp`
#> FROM `mtcars`
#> ORDER BY -`mpg`, -`disp`

See: https://github.com/tidyverse/dbplyr/issues/129

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement