Skip to content
Advertisement

How do you filter with a Dataframe, list, vector etc. to a table in a database in R?

I have a large set of id-s which is in a csv file. How could I filter on a database table using only that one-columned table in the csv file?

For example in the ODBC database we have:

TABLE 1

+---------+------+
|   ID    | TYPE |
+---------+------+
| 43PRJIF | A    |
| 35IRPFJ | A    |
| 452JSU  | B    |
| 78JFIER | B    |
| 48IRUW  | C    |
| 89UEJDU | C    |
| 784NFJR | D    |
| 326NFR  | D    |
| 733ZREW | E    |
+---------+------+

And in the CSV file we have:

+---------+
|   ID    |
+---------+
| 89UEJDU |
| 784NFJR |
| 326NFR  |
| 733ZREW |
+---------+

Basically I would like to use something from the dbplyr package if possible. E.g importing the csv table to a dataframe then use a syntax in dbplyr like:

new_table <- TABLE1 %>%
filter(id == "ROWS IN THE CSV") 

To get an output like that:

+---------+------+
|   ID    | TYPE |
+---------+------+
| 89UEJDU | C    |
| 784NFJR | D    |
| 326NFR  | D    |
| 733ZREW | E    |
+---------+------+

Thank you for your help in advance!

Advertisement

Answer

In general joining or merging tables requires them to share the same environment. Hence, there are three general options here:

  1. Load the remote table into R’s local workspace
  2. Load the CSV table into the database and use a semi-join.
  3. ‘Smuggle’ the list of IDs in the CSV into the database

Let’s consider each in turn:

Option 1

This is probably the simplest option but it requires that the remote/ODBC table is small enough to fit in R’s working memory. If so, you can call local_table = collect(remote_table) to load the database table.

Option 2

dbplyr includes a command copy_to (ref) that lets you copy local tables via odbc to a database/remote connection. You will need to have permission to create tables in the remote environment.

This approach makes use of the DBI package. At the time of writing v1.0.0 of DBI on CRAN has some limitations when writing to non-default schemas. So you may need to upgrade to the development version on GitHub (here).

Your code will look something like:

DBI::dbWriteTable(db_connection,
                  DBI::Id(schema = "schema", table = "name")),
                  r_table_name)

Option 3

Smuggle the list of IDs into the database via the table definition. This is the same idea as here, and works best if the list of IDs is short.

Remote tables are essentially defined by the code/query that fetches their results. Hence the list of IDs can appear in the code that defines your remote table. Consider the following example:

library(dplyr)
library(dbplyr)
data(mtcars)

list_of_ids = c(1,2,3,4)
df = tbl_lazy(mtcars, con = simulate_mssql())
df %>% filter(ID %in% list_of_ids ) %>% show_query()

show_query() renders the code that defines the current version of the remote table. In the example above it returns the following – note that the list of IDs now appears in the code.

<SQL>
SELECT *
FROM `df`
WHERE (`ID` IN (1.0, 2.0, 3.0, 4.0))

If the list of IDs is very long, the size of this query will become a problem. Hence there is a limit on the number of IDs you can filter on using this approach (I have not tested this approach to find the limit – I seldom using the IN clause for a list of more than 10).

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement