Faster Import with R Redux

Sometime ago I wrote a post comparing the speed of importing datasets based on their format. In that post I compared read.csv and fread, from the data.table package, for importing CSV files and also an R SQL query to read the same data from an SQLite database.

The short conclusion from that earlier post was that fread is always much quicker than read.csv and also quicker than returning data from the SQLite database. The SQL query route was almost as quick as fread if the query was structured in such a way as to return specific fields, rather than just SELECT * from db. The most interesting point was that fread was quicker importing the entire 52MB test CSV file, than it was to subset certain columns and then import.

Enough of the past. Recently, I've once again had cause to investigate data loading times across different file types. In this case, the files were a CSV, SQLite database and also an Rds file.

An Rds file is an R specific file format. A single object saved to a binary Rds file is serialised, retaining all its structures and properties. This avoids the stringsAsFactors = FALSE dance with CSVs, and various other tricky SQLite data type issues (dates are especially tricky). The other great thing about Rds files is that they are compressed by default, which drastically reduces file sizes when compared to the other two formats. The downside is that the Rds file then needs to be re-inflated when reading back into R.

While creating the pinhooker R package, I was working with all three file types, which the package supports creating. Once the full 152,000 rows of data, across 19 columns, had been imported, the resulting approximate file sizes were as follows:

  • CSV: 34.9MB
  • SQLite: 35MB
  • Rds: 4MB

The Rds file is smaller by greater than a factor of ten. That's quite a size saving.

This is the main reason I chose to use Rds as the default file type produced by the pinhooker package, rather than a more common CSV. The pinhooker dataset will continue to grow, and therefore so will the file sizes. I'd much rather an installable R package of around 4MB, than a requirement to download in excess of 35MB.

Nonetheless, it's still worth looking at comparative file loading times. Let's start with the CSV and the base R read.csv:

# Start the clock!
ptm <- proc.time()

# Load CSV file
bloodstockSaleData <- read.csv("bloodstockSalesData.csv", sep=",", as.is = TRUE)

# Stop the clock
proc.time() - ptm

user  system elapsed  
4.596   0.032   4.869  

Not terribly slow, but still a noticeable lag while waiting for the code to complete.

Next, fread from the data.table package:

# Start the clock!
ptm <- proc.time()

# Load necessary libraries
library(data.table)

# Load CSV file
bloodstockSaleData <- fread("bloodstockSalesData.csv", sep=",")

# Stop the clock
proc.time() - ptm

user  system elapsed  
0.840   0.028   0.863  

Blisteringly quick, and almost half of that time is accounted for by the library loading. A second load of the csv, with the data.table library already loaded results in:

user  system elapsed  
0.520   0.008   0.525  

Now an SQL query to return data from the SQLite database:

# Start the clock!
ptm <- proc.time()

# Load necessary libraries
library(RSQLite)

# Load SQL data
con <- dbConnect(SQLite(), "bloodstockSalesData.sqlite")  
sql1 <- paste("SELECT * FROM bloodstockSalesData", sep="")  
bloodstockSalesData <- dbGetQuery(con, sql1)  
dbDisconnect(con)

# Stop the clock
proc.time() - ptm

user  system elapsed  
2.284   0.320   8.865  

Substantially slower than even read.csv. Is there a significant difference if the RSQLite library is already loaded?

user  system elapsed  
0.932   0.040   0.968  

Most certainly there is. Still not as fast as fread, but if multiple SQL queries were required, after the RSQLite library is first loaded, import is quite quick.

Let's look at loading an Rds file.

# Start the clock!
ptm <- proc.time()

# Load Rds data
bloodstockSalesData <- readRDS("bloodstockSalesData.rds")

# Stop the clock
proc.time() - ptm

user  system elapsed  
1.324   0.020   1.409  

Not quite as fast as fread or an SQL query with library already loaded, but much quicker than read.csv. Although readRDS doesn't match the outright speed of fread, the vastly reduced file size is certainly a very fair trade-off.

Hadley Wickham's readr package is also worth mentioning, as it supports fast reading of both CSVs and Rds files. For the sake of completeness, these functions were also tested.

readr's read_csv:

# Start the clock!
ptm <- proc.time()

# Load necessay libraries
library(readr)

# Load CSV data
bloodstockSalesData <- read_csv("bloodstockSalesData.csv")

# Stop the clock
proc.time() - ptm

user  system elapsed  
1.456   0.020   1.474 

And again, with library already loaded:

user  system elapsed  
0.944   0.012   0.948  

This time read_rds:

# Start the clock!
ptm <- proc.time()

# Load necessay libraries
library(readr)

# Load Rds data
bloodstockSalesData <- read_rds("bloodstockSalesData.Rds")

# Stop the clock
proc.time() - ptm

user  system elapsed  
1.828   0.024   1.848 

And lastly, with readr already loaded:

user  system elapsed  
1.396   0.004   1.398  

Unfortunately, the readr package doesn't seem to offer an advantage, in terms of raw loading speed, for this sample set of files. Although there are other nice aspects of readr, such as the CSV parsing failure warnings returned.

It may also be worth averaging times across multiple data loads, as there will be small variations, but not large enough to change the overall outcome in this scenario.

In conclusion, fread still wins the outright speed sprint when it comes to loading CSV data in R. However, with the radically reduced file size, object immutability and still very quick loading times, the Rds file is an excellent overall choice.

sessionInfo()

R version 3.2.3 (2015-12-10)  
Platform: x86_64-pc-linux-gnu (64-bit)  
Running under: Debian GNU/Linux stretch/sid

locale:  
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:  
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:  
[1] RSQLite_1.0.0    DBI_0.3.1        readr_0.2.2      data.table_1.9.6

loaded via a namespace (and not attached):  
[1] tools_3.2.3  Rcpp_0.12.3  chron_2.3-47