How do you filter with a Dataframe, list, vector etc. to a table in a database in R?

Question

I have a large set of id-s which is in a csv file. How could I filter on a database table using only that one-columned table in the csv file? For example in the ODBC database we have: TABLE 1 +&#8211;&#8230;

Accepted Answer

In general joining or merging tables requires them to share the same environment. Hence, there are three general options here:Load the remote table into R&#8217;s local workspaceLoad the CSV table into the database and use a semi-join.&#8216;Smuggle&#8217; the list of IDs in the CSV into the databaseLet&#8217;s consider each in turn:Option 1This is probably the simplest option but it requires that the remote/ODBC table is small enough to fit in R&#8217;s working memory. If so, you can call local_table = collect(remote_table) to load the database table.Option 2dbplyr includes a command copy_to (ref) that lets you copy local tables via odbc to a database/remote connection. You will need to have permission to create tables in the remote environment.This approach makes use of the DBI package. At the time of writing v1.0.0 of DBI on CRAN has some limitations when writing to non-default schemas. So you may need to upgrade to the development version on GitHub (here).Your code will look something like:DBI::dbWriteTable(db_connection,                  DBI::Id(schema = "schema", table = "name")),                  r_table_name)Option 3Smuggle the list of IDs into the database via the table definition. This is the same idea as here, and works best if the list of IDs is short.Remote tables are essentially defined by the code/query that fetches their results. Hence the list of IDs can appear in the code that defines your remote table. Consider the following example:library(dplyr)library(dbplyr)data(mtcars)list_of_ids = c(1,2,3,4)df = tbl_lazy(mtcars, con = simulate_mssql())df %>% filter(ID %in% list_of_ids ) %>% show_query()show_query() renders the code that defines the current version of the remote table. In the example above it returns the following &#8211; note that the list of IDs now appears in the code.<SQL>SELECT *FROM `df`WHERE (`ID` IN (1.0, 2.0, 3.0, 4.0))If the list of IDs is very long, the size of this query will become a problem. Hence there is a limit on the number of IDs you can filter on using this approach (I have not tested this approach to find the limit &#8211; I seldom using the IN clause for a list of more than 10).

Advertisement

Answer

Option 1

Option 2

Option 3