Faster CSV Import with R

This article is inspired by a Twitter conversation with Jason Hathorn (@JJHeathview) and Tom (@UTVIlla), responding to an earlier post, Create an SQLite database with R.

One of the key advantages in converting large CSV files to an SQLite database is the speed in which subsequent data queries can be returned. Other reasons include re-usability of data structures and re-use of existing knowledge if SQL notation is already known.

After Jason noted how fast importing CSVs via the data.table package was, Tom went on to suggest fread from the same package is also a very fast way to import large datasets. This page on the R-Bloggers website also displays some comparisons between the different methods of initially importing CSV files. Clearly fread is a long way ahead of the competition. However, the R-Bloggers page does not include a specific SQL comparison.

The dataset used in these comparisons is a 52MB CSV file, consisting of 44246 rows and 267 columns. This file contains historical rating and racing data from the Formbet service, over a seven month period beginning in January 2014. Unfortunately, this specific data cannot be shared publicily.

The R session is restarted and all global variables cleared between each test. Minimal additional applications are running in parallel. R scripts are executed withing RStudio, which does create some processor overhead. A more correct method of comparison would be to conduct multiple executions of each code snippet, then average the complete set. However, this is most likely overkill for this simple comparison.

Import Entire File

Three methods will be used, read.csv, fread and an R SQL query, to load the entire data structure into R. The SQL query assumes the original CSV has been saved as an SQLite database, as discussed in a previous article.

read.csv

system.time(read.csv(file="FBDatabase.csv", sep=",", header=TRUE, stringsAsFactors=FALSE))
   user  system elapsed 
 16.156   0.076  16.226 

fread

require("data.table")

system.time(fread("FBDatabase.csv", sep = ',', header = TRUE, stringsAsFactors=FALSE))
   user  system elapsed 
  2.468   0.020   2.486

R SQL Query

require("RSQLite")

con <- dbConnect(SQLite(), "formbet.sqlite")

sql1 <-paste("SELECT * FROM formbet_historic", sep="")

system.time(results <- dbGetQuery(con, sql1))

dbDisconnect(con)
 user  system elapsed 
  7.704   0.120   7.813 

Obviously fread is significantly quicker than both read.csv and the R SQL query. In turn, the R SQL query is faster than read.csv.

Import Specific Fields

These tests involve just importing eight specific fields from the dataset:

Date
Horse
Trainer
Jockey
Course
Time
Win
BSP

These fields represent what may be a valid set of columns if one wished to calculate, for example, a trainer or jockey strike rate at a specific track.

read.csv

It is possible to just read specific columns using the colClasses attribute of read.csv. Unfortunately, the fields cannot be named, rather their numerical order within the csv file must be known. Therefore, read.csv is not practical for importing only specific columns from a large CSV file.

fread

require("data.table")

system.time(originalcsv <- fread("FBDatabase.csv", sep = ',', header = TRUE, stringsAsFactors=FALSE, select=c("Date", "Horse", "Trainer", "Jockey", "Course", "Time", "Win", "BSP")))
   user  system elapsed 
  0.324   0.008   0.330 

R SQL Query

require("RSQLite")

con <- dbConnect(SQLite(), "formbet.sqlite")

sql1 <-paste("SELECT formbet_historic.Date, formbet_historic.Horse, formbet_historic.Trainer, formbet_historic.Jockey, formbet_historic.Course, formbet_historic.Time, formbet_historic.Win, formbet_historic.BSP FROM formbet_historic", sep="")

system.time(results <- dbGetQuery(con, sql1))

dbDisconnect(con)
user  system elapsed 
  0.376   0.032   0.407 

Again, fread returned results faster than the R SQL query, although the time difference is now much less. Perhaps what is more interesting is that fread was quicker to import the entire CSV file, than to merely return eight requested columns.

When to use fread

fread is a very quick method of importing large CSV datasets. It is far superior to the more usual read.csv. One off tasks, such as importing sectional times from a single race, or even repetitive tasks using new data on each occasion, such as returning selections or shortlisting horses from a daily ratings file, would be very well suited to fread. Verbose by default, fread also exposes some interesting additional information each time it is used.

When to use an R SQL Query

If the requirements of returning data are slightly more complex, the R SQL query may be more practical. For example, returning the same eight fields, but this time between two specific dates:

require("RSQLite")

con <- dbConnect(SQLite(), "formbet.sqlite")

sql1 <-paste("SELECT formbet_historic.Date, formbet_historic.Horse, formbet_historic.Trainer, formbet_historic.Jockey, formbet_historic.Course, formbet_historic.Time, formbet_historic.Win, formbet_historic.BSP FROM formbet_historic WHERE Date BETWEEN '2014-02-01' AND '2014-03-31'", sep="")

system.time(results <- dbGetQuery(con, sql1))

dbDisconnect(con)
user  system elapsed 
  0.112   0.036   0.146

The elapsed time in this case is much shorter, as only a small subset of data is being returned. It is not believed possible to return only data between two specific dates using fread

If the data is to be re-used over a longer period of time, with multiple analysis tasks taking place, the logic and structure of an SQL database may also have advantages.

It is worth noting that very complex SQL queries can make for slow results. As an example, it is not the fastest route, although certainly possible, to calculate strike rates or financial returns using complex SQL queries from within R. A better approach is to simply return the relevant data fields via an SQL SELECT and perform the calculations with native R code.

Conclusion

Both fread and an R SQL query are fast routes to importing large datasets in R. Both are far superior to read.csv. Determining which to use should be made on a case by case basis.