Skip to content
Advertisement

How to convert large .csv file with “too many columns” into SQL database

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let’s call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let’s call these ID2s.

This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.

My idea was to read the file directly (with datatable fread), convert it into “long” data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file’s size to store it).

I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.

I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.

I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.

Any suggestions for an efficient way of processing the .csv file?

Best,

Edit: I was able to read the initial file in R, but I still find some problems:

1- I cannot write into a sqlite db because of the “too many columns” limit. 2- I cannot pivot it inside R because I get the error: Error: cannot allocate vector of size 7.8 Gb Eventhough my memory limit is high enough. I have 8.5 Gb of free memory and:

> memory.limit()
[1] 16222

I have used @danlooo ‘s code but the data is not in the format I would like it to be. Probably I was not clear enough explaining its structure.

Here is an example of how I would like the data to look like (ID1 = Sample, ID2 = name, value = value)

> test = input[1:5,1:5]
> 
> test
      Sample DRX007662 DRX007663 DRX007664 DRX014481
1: AT1G01010 12.141565 16.281420 14.482322  35.19884
2: AT1G01020 12.166693 18.054251 12.075236  37.14983
3: AT1G01030  9.396695  9.704697  8.211935   4.36051
4: AT1G01040 25.278412 24.429031 22.484845  17.51553
5: AT1G01050 64.082870 66.022141 62.268711  58.06854
> test2 = pivot_longer(test, -Sample)
> test2
# A tibble: 20 x 3
   Sample    name      value
   <chr>     <chr>     <dbl>
 1 AT1G01010 DRX007662 12.1 
 2 AT1G01010 DRX007663 16.3 
 3 AT1G01010 DRX007664 14.5 
 4 AT1G01010 DRX014481 35.2 
 5 AT1G01020 DRX007662 12.2 
 6 AT1G01020 DRX007663 18.1 
 7 AT1G01020 DRX007664 12.1 
 8 AT1G01020 DRX014481 37.1 
 9 AT1G01030 DRX007662  9.40
10 AT1G01030 DRX007663  9.70
11 AT1G01030 DRX007664  8.21
12 AT1G01030 DRX014481  4.36
13 AT1G01040 DRX007662 25.3 
14 AT1G01040 DRX007663 24.4 
15 AT1G01040 DRX007664 22.5 
16 AT1G01040 DRX014481 17.5 
17 AT1G01050 DRX007662 64.1 
18 AT1G01050 DRX007663 66.0 
19 AT1G01050 DRX007664 62.3 
20 AT1G01050 DRX014481 58.1 

> test3 = test2 %%>%% group_by(Sample) %%>%% summarize(mean(value))
> test3
# A tibble: 5 x 2
  Sample `mean(value)`
  <chr>                  <dbl>
1 AT1G01010              19.5 
2 AT1G01020              19.9 
3 AT1G01030               7.92
4 AT1G01040              22.4 
5 AT1G01050              62.6 

How should I change the code to make it look that way?

Thanks a lot!

Advertisement

Answer

Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.

library(tidyverse)
library(DBI)
library(vroom)

conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))

file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5

# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
  # everything must be character to allow pivoting numeric and  text columns
  vroom(file, skip = chunk_offset, n_max = chunk_size,
    col_names = FALSE, col_types = cols(.default = col_character())
  ) %%>%%
    pivot_longer(everything()) %%>%%
    dbAppendTable(conn, "data", value = .)
}

data <- conn %%>%% tbl("data") 
data
#> # Source:   table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#>    name  value    
#>    <chr> <chr>    
#>  1 X1    Mazda RX4
#>  2 X2    21       
#>  3 X3    6        
#>  4 X4    160      
#>  5 X5    110      
#>  6 X6    3.9      
#>  7 X7    2.62     
#>  8 X8    16.46    
#>  9 X9    0        
#> 10 X10   1        
#> # … with more rows

data %%>%%
  # summarise only the 3rd column
  filter(name == "X3") %%>%%
  group_by(value) %%>%%
  count() %%>%%
  arrange(-n) %%>%%
  collect()
#> # A tibble: 3 × 2
#>   value     n
#>   <chr> <int>
#> 1 8        14
#> 2 4        11
#> 3 6         7

Created on 2022-04-15 by the reprex package (v2.0.1)

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement