So pretty much I am pulling a whole table from my database using:
ss_data<-dbConnect(MySQL(), user = ,password=, host= ,dbname= ) Combined_Query<-gsub("[rn]"," ",paste0("select * FROM ",table_name)) Q_temp<-dbSendQuery(ss_data,Combined_Query) temp<-fetch(Q_temp, n=-1) report_pull<-rbind(report_pull,temp) dbDisconnect(ss_data)
So this pulls my data as I need. After that perform some tests and to check the size of each column. I reassign field types based on my test. After that I would send the data back to the database using:
fieldtype<- setNames(newcoltypes, colnames(report_pull)) ss_data<-dbConnect(MySQL(), user = ,password= , host= ,dbname= ) dbWriteTable(ss_data, value= report_pull, name="datatest1" ,overwrite=TRUE,row.names=FALSE, field.types = fieldtype) dbDisconnect(ss_data)
Because I am not manipulating any of the data and I’m sending it back, I was wondering if there was a way to only send field.type data.
I am going to be doing this for ~500GB worth of tables, it would greatly reduce time spent if I didn’t have to keep pushing back the row values every time.
EDIT: After reading the comments I started updating in my loop.
for (x in 1:ncol(report_pull)){ ss_data<-dbConnect(MySQL(), user = ,password=, host=,dbname=) a <- unique(report_pull[x]) a[a==""] <- NA a <- na.omit(a) if (length(a[,]) != 0){ if ((grepl("^[0-9.]+$", a[,] )) == TRUE){ if ((grepl("^[0-9]+$", a[,] )) == TRUE){ if (typeof(report_pull[,x]) == "double"){ dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` double;")) listofcol[x] <- "double"} else { # For int, we need to make it tiny text, int isnt working on our DB dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` tinytext;")) listofcol[x] <- "int - tinytext"} } else { dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` double;")) listofcol[x] <- "double" } } else { if (between(max(nchar(a[,])), 50, 255)) { dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` varchar(255);")) listofcol[x] <- paste0("varchar(255)")} else if (max(nchar(a[,])) > 255) { dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` tinytext;")) listofcol[x] <- paste0("tinytext") } else { dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` varchar(", max(nchar(a[,])),");")) listofcol[x] <- paste0("varchar(", max(nchar(a[,])),")")} } if (grepl("url|URL", (colnames(report_pull[x] ))) == TRUE){ dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` tinytext;")) listofcol[x] <- "tinytext" } } else { print(paste0("Element ",x[1], " has 0 entries.")) dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` varchar(255);")) listofcol[x] <- "varchar(255)" } dbDisconnect(ss_data) }
the issue I am facing is in this line of code:
dbExecute(ss_data,paste0("ALTER TABLE ",table," MODIFY COLUMN `", colnames(report_pull[x]) ,"` tinytext;"))
The error I get on both DB and on R saying Incorrect integer value: ” for column ‘id’ at row 1
I think the issue is that when this table was made, all the field types were set as TEXT. Im just having a hard to setting it to int.
Advertisement
Answer
I’m not certain that this will fix all of your woes, but it’s an in-database method (with some help from R) for redefining column types. I’m assuming the data is uploaded always as TEXT
regardless of the actual data, so this process will attempt to fix this.
Docker setup
Not required for your setup, but useful if anybody else needs to test this.
$ docker pull mysql:8.0.26 $ docker run -p 3306:3306 --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:8.0.26
Confirm connection in R:
my <- DBI::dbConnect(odbc::odbc(), Driver="MySQL ODBC 8.0 ANSI Driver", pwd="my-secret-pw", uid="root") DBI::dbGetQuery(my, "SHOW VARIABLES LIKE 'version';") # Variable_name Value # 1 version 8.0.26 DBI::dbExecute(my, "create schema stackoverflow") DBI::dbExecute(my, "use stackoverflow")
And when demos/tests are complete (after running the rest of this answer’s code),
DBI::dbExecute(my, "drop table sometable") DBI::dbExecute(my, "drop schema stackoverflow")
Fake data
DBI::dbWriteTable(my, "sometable", data.frame( longstring = c("a very long string here that is going to trigger perhaps a varchar(255)", "234", "3.1415"), somefloat = c("234", "3.1415", "2.718271828"), someint = c("234", "3", "2"), longerstring = c(strrep("A", 260), "A", "A") ))
Retrieve table stats
This is where most of the ‘pain’ will be felt: I have not tested this on gargantuan tables, so please provide some feedback. Since it uses a lot of REGEXP
, I don’t assume that it will return in milliseconds, but perhaps it’s faster than downloading and loading into R.
tests <- do.call(rbind, lapply(DBI::dbListFields(my, "sometable"), function(nm) { DBI::dbGetQuery(my, sprintf(" select %s as column_name, count(*) as n, sum(%s regexp '^[-+]?.*[^0-9.].*$') as nonnumber, sum(%s regexp '^[-+]?.*[^0-9].*$') as nonint, max(length(%s)) as maxlen from sometable where %s is not null and %s <> ''", sQuote(nm), nm, nm, nm, nm, nm)) })) tests # column_name n nonnumber nonint maxlen # 1 longstring 3 1 2 71 # 2 somefloat 3 0 2 11 # 3 someint 3 0 0 3 # 4 longerstring 3 3 3 260
From here, we need to use these “rules” to determine what the new column type should be. I’m using dplyr::case_when
because I find it much better than nested ifelse
or the like. If you’re using dplyr
in general, then this will fit right in to your current dialect and toolset; if you’re using data.table
, then it should be rather straight-forward to translate this to data.table::fcase
.
tests$new_type <- with(tests, dplyr::case_when( n == 0 ~ "varchar(255)", grepl("url", column_name, ignore.case = TRUE) ~ "tinytext", nonint == 0 ~ "tinytext", # For int, we need to make it tiny text, int isnt working on our DB nonnumber == 0 ~ "double", maxlen < 50 ~ "tinytext", between(maxlen, 50, 255) ~ "varchar(255)", TRUE ~ sprintf("varchar(%s)", maxlen))) tests # column_name n nonnumber nonint maxlen new_type # 1 longstring 3 1 2 71 varchar(255) # 2 somefloat 3 0 2 11 double # 3 someint 3 0 0 3 tinytext # 4 longerstring 3 3 3 260 varchar(260)
You will need to review these rules carefully to make sure they capture all of your intent. However, I think the premise of simple rules like this (and the four “meta” columns I stored into tests
) is a good start.
Update the columns
Copying the method from https://dba.stackexchange.com/a/198635/156305, I’ll loop over the rows, create a new column as the new type, copy/CAST
from the old column, remove the old column, then rename the new to the original name.
I’ll show the column data types before and after the conversion.
DBI::dbGetQuery(my, "select column_name, data_type, character_maximum_length from information_schema.columns where table_name='sometable'") # COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH # 1 longerstring text 65535 # 2 longstring text 65535 # 3 somefloat text 65535 # 4 someint text 65535 for (rownum in seq_len(nrow(tests))) { nm <- tests$column_name[rownum] typ <- tests$new_type[rownum] typ2 <- if (grepl("varchar", typ)) { gsub("var", "", typ) } else if (typ == "tinytext") { "char(50)" } else typ message(sprintf("Converting column '%s' to '%s' ('%s')", nm, typ, typ2)) DBI::dbExecute(my, sprintf("ALTER TABLE sometable ADD COLUMN newcolumn %s", typ)) DBI::dbExecute(my, sprintf("UPDATE sometable SET newcolumn = cast(%s as %s)", nm, typ2)) DBI::dbExecute(my, sprintf("ALTER TABLE sometable DROP COLUMN %s", nm)) DBI::dbExecute(my, sprintf("ALTER TABLE sometable RENAME COLUMN newcolumn to %s", nm)) } # Converting column 'longstring' to 'varchar(255)' ('char(255)') # Converting column 'somefloat' to 'double' ('double') # Converting column 'someint' to 'tinytext' ('char(50)') # Converting column 'longerstring' to 'varchar(260)' ('char(260)') DBI::dbGetQuery(my, "select column_name, data_type, character_maximum_length from information_schema.columns where table_name='sometable'") # COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH # 1 longerstring varchar 260 # 2 longstring varchar 255 # 3 somefloat double <NA> # 4 someint tinytext 255
And to prove that the data is still present and correct,
str(DBI::dbGetQuery(my, "select * from sometable")) # 'data.frame': 3 obs. of 4 variables: # $ longstring : chr "a very long string here that is going to trigger perhaps a varchar(255)" "234" "3.1415" # $ somefloat : num 234 3.14 2.72 # $ someint : chr "234" "3" "2" # $ longerstring: chr "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"| __truncated__ "A" "A"
Caveats
- I do not have much experience with the performance of
REGEXP
, I might be abusing a “heavy tool” here. - Copying data from one column to the other may not be cheap, I have not tested with “large-ish data”.
- I use
char(.)
andvarchar(.)
, you may prefer to usenchar(.)
andnvarchar(.)
, depending on your data. - The reason I create
typ2
is because MySQL does not acceptcast(.. as varchar(n))
(https://stackoverflow.com/a/59179977/3358272); it also appears to not likecast(.. as tinytext)
. Ergo the temporary translation duringcast
ing. - There may be much more MySQL-canonical ways to do many of these steps. I do not profess MySQL guru-status by any stretch.