How to convert large .csv file with “too many columns” into SQL database

Question

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let&#8217;s call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let&#8217;s call these ID2s. This is the first time I work with such large files. I wanted to process the .c…

Accepted Answer

Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.library(tidyverse)library(DBI)library(vroom)conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")dbCreateTable(conn, "data", tibble(name = character(), value = character()))file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"chunk_size <- 10 # read this many lines of the text file at oncen_chunks <- 5# start with offset 1 to ignore headerfor(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) { # everything must be character to allow pivoting numeric and text columns vroom(file, skip = chunk_offset, n_max = chunk_size, col_names = FALSE, col_types = cols(.default = col_character()) ) %>% pivot_longer(everything()) %>% dbAppendTable(conn, "data", value = .)}data <- conn %>% tbl("data") data#> # Source: table [?? x 2]#> # Database: sqlite 3.37.0 [my-db.sqlite]#> name value #> #> 1 X1 Mazda RX4#> 2 X2 21 #> 3 X3 6 #> 4 X4 160 #> 5 X5 110 #> 6 X6 3.9 #> 7 X7 2.62 #> 8 X8 16.46 #> 9 X9 0 #> 10 X10 1 #> # … with more rowsdata %>% # summarise only the 3rd column filter(name == "X3") %>% group_by(value) %>% count() %>% arrange(-n) %>% collect()#> # A tibble: 3 × 2#> value n#> #> 1 8 14#> 2 4 11#> 3 6 7Created on 2022-04-15 by the reprex package (v2.0.1)

Advertisement

Answer