Skip to content
Advertisement

Can you name dbplyr’s simulated lazy tables?

dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can’t seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can’t seem to modify them afterward. In fact, I don’t see `df` anywhere in the query object or its attributes (it doesn’t have any), so now I have no idea how dbplyr processes table names at all.

MWE:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)

query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x"    "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`

# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`

My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I’d like to have R scripts that produce SQL syntax that’s ready to run in that system, even though R can’t connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it’d be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it’s really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)

Created on 2019-12-12 by the reprex package (v0.3.0)

Advertisement

Answer

I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.

When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:

# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()

# output
<SQL>
SELECT TOP(5) *
FROM `df`

# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()

# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"

Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.

One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:

df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)

# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))

remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement