I want to fetch data from mysql with seqlpro in R but when I run the query it takes ages. here is my code :
old_value<- data.frame() new_value<- data.frame() counter<- 0 for (i in 1:length(short_list$id)) { mydb = OpenConn(dbname = '**', user = '**', password = '**', host = '**') query <- paste0("select * from table where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" ) temp_old <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1 query <- paste0("select * from table2 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" ) temp_new <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1) RMySQL::dbDisconnect(mydb) new_value<- rbind(temp_new,new_value) old_value<- rbind(temp_old,old_value) counter=counter+1 base::print(paste("completed for ",counter),sep="") }
is there any way that I can writ it more efficient and call the queries faster because i have around 5000 rows which should go into the loop. Actually this query works but it takes time.
I have tried this but still it gives me error :
#parralel computing clust <- makeCluster(length(6)) clusterEvalQ(cl = clust, expr = lapply(c('data.table',"RMySQL","dplyr","plyr"), library, character.only = TRUE)) clusterExport(cl = clust, c('config','short_list'), envir = environment()) new_de <- parLapply(clust, short_list, function(id,country) { for (i in 1:length(short_list$id)) { mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**') query <- paste0("select * from table1 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" ) temp_data <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1) %>% data.table::data.table() RMySQL::dbDisconnect(mydb) return(temp_data)} }) stopCluster(clust) gc(reset = T) new_de <- data.table::rbindlist(new_de, use.names = TRUE)
I have also defined the list of short_list as following:
short_list<- as.list(short_list)
and inside short_list is:
id. country 2 US 3 UK ... ...
However it gives me this error:
Error in checkForRemoteErrors(val) : one node produced an error: object ‘i’ not found
However when I remove i from the id[i] and country[i] it only give me the first row result not get all ids and country result.
Advertisement
Answer
I think an alternative is to upload the ids you need into a temporary table, and query for everything at once.
tmptable <- "mytemptable" dbWriteTable(conn, tmptable, short_list, create = TRUE) alldat <- dbGetQuery(conn, paste(" select t1.* from ", tmptable, " tmp left join table1 t1 on tmp.id=t1.id and tmp.country=t1.country where t1.`date` >= '2019-04-31' and t1.`date` <= '2020-09-1'")) dbExecute(conn, paste("drop table", tmptable))
(Many DBMSes use a leading #
to indicate a temporary table that is only visible to the local user, is much less likely to clash in the schema namespace, and is automatically cleaned when the connection is closed. I generally encourage use of temp-tables here, check with your DB docs, schema, and/or DBA for more info here.)
The order of tables is important: by pulling all from mytemptable
and then left join table1
onto it, we are effectively filtering out any data from table1
that does not include a matching id
and country
.
This doesn’t solve the speed of data download, but some thoughts on that:
- Each time you iterate through the queries, you have not-insignificant overhead; if there’s a lot of data then this overhead should not be huge, but it’s still there. Using a single query will reduce this overhead significantly.
- Query time can also be affected by any index(ices) on the tables. Outside the scope of this discussion, but might be relevant if you have a large-ish table. If the table is not indexed efficiently (or the query is not structured well to use those indices), then each query will take a finite amount of time to “compile” and return data. Again, overhead that will be reduced with a single more-efficient query.
- Large queries might benefit from using the command-line tool
mysql
; it is about as fast as you’re going to get, and might iron over any issues inRMySQL
and/orDBI
. (I’m not saying they are inefficient, but … it is unlikely that a free open-source driver will be faster than MySQL’s own command-line utility.
As for doing this in parallel …
You’re using
parLapply
incorrectly. It accepts a single vector/list and iterates over each object in that list. You might use it iterating over the indices of a frame, but you cannot use it to iterate over multiple columns within that frame. This is exactly like base R’slapply
.Let’s show what is going on when you do your call. I’ll replace it with
lapply
(because debugging in multiple processes is difficult).# parLapply(clust, mtcars, function(id, country) { ... }) lapply(mtcars, function(id, country) { browser(); 1; }) # Called from: FUN(X[[i]], ...) debug at #1: [1] 1 id # [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 # [24] 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4 country # Error: argument "country" is missing, with no default
Because the argument (
mtcars
here,short_list
in yours) is adata.frame
, since it is alist
-like object,lapply
(andparLapply
) operate on each column at a time. You were hoping that it would “unzip” the data, applying the first column’s value toid
and the second column’s value tocountry
. In fact, the is a function that does this:Map
(and the parallel’sclusterMap
, as I suggested in my comment). More on that later.The intent of parallelizing things is to not use the
for
loop inside the parallel function. Ifshort_list
has 10 rows, and if your use ofparLapply
were correct, then you would be querying all rows 10 times, making your problem significantly worse. In pseudo-code, you’d be doing:parallelize for each row in short_list: # this portion is run simultaneously in 10 difference processes/threads for each row in short_list: query for data related to this row
Two alternatives:
Provide a single argument to
parLapply
representing the rows of the frame.new_de <- new_de <- parLapply(clust, seqlen(NROW(short_list)), function(rownum) { mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**') on.exit({ DBI::dbDisconnect(mydb) }) tryCatch( DBI::dbGetQuery(mydb, " select * from table1 where id=? and country=? and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'", params = list(short_list$id[rownum], short_list$country[rownum])), error = function(e) e) })
Use
clusterMap
for the same effect.new_de <- clusterMap(clust, function(id, country) { mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**') on.exit({ DBI::dbDisconnect(mydb) }) tryCatch( DBI::dbGetQuery(mydb, " select * from table1 where id=? and country=? and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'", params = list(id, country)), error = function(e) e) }, short_list$id, short_list$country)
If you are not familiar with
Map
, it is like “zipping” together multiple vectors/lists. For example:myfun1 <- function(i) paste(i, "alone") lapply(1:3, myfun1) ### "unrolls" to look like list( myfun1(1), myfun1(2), myfun1(3) ) myfun3 <- function(i,j,k) paste(i, j, k, sep = '-') Map(f = myfun3, 1:3, 11:13, 21:23) ### "unrolls" to look like list( myfun3(1, 11, 21), myfun3(2, 12, 22), myfun3(3, 13, 23) )
Some liberties I took in that adapted code:
- I shifted from the
dbSendQuery
/dbFetch
double-tap to a single call todbGetQuery
. - I’m using
DBI
functions, sinceDBI
functions provide a superset of what each driver’s package provides. (You’re likely using some of it anyway, perhaps without realizing it.) You can switch back with no issue. - I added
tryCatch
, since sometimes errors can be difficult to deal with in parallel processes. This means you’ll need to check the return value from each of your processes to see if eitherinherits(ret, "error")
(problem) oris.data.frame
(normal). - I used
on.exit
so that even if there’s a problem, the connection closure should still occur.