R Data Frames - dplyr vs sqldf

In my previous post I looked at the speed of importing data using five different methods. The fastest of these was fread from the data.table package, which outperformed base R read.csv, base R readRDS, read_csv from Hadley Wickham's readr and also importing the same data from an SQLite database with the RSQLite package.

Today I've just learned about the sqldf package, which provides support for working directly with data frames, using SQL syntax. In my view, SQL syntax is really simple to learn and understand. Personally, it is much closer to parsing natural language, than working with some of R's native syntax.

sqldf leverages SQLite SELECT calls to manipulate data frames. I found a very interesting and quite descriptive article on R-bloggers, although initially published almost two years ago.

While SQL syntax may be easy enough to understand, how does it really hold up in terms of performance? Unfortunately, the short answer is, not very well.

My test data set is 231,710 observations of 54 variables, stored in a 15.1MB Rds file. The data is, as you might expect, horse racing related. Using dplyr as a reference, a simple subset of the column TRAINER was selected.

library("dplyr")

# Load the data file as a data frame
fbHistoricResultsSQL <- as.data.frame(readRDS("files/fbHistoricResults.rds"))

# Start the timer
ptm <- proc.time()

# Subset the data
trainerlist <- dplyr::select(fbHistoricResultsSQL, TRAINER)

# Stop the timer
proc.time() - ptm  

Results:

   user  system elapsed 
  0.000   0.000   0.003 

This time using sqldf to do the same thing:

library("sqldf")

# Load the data file as a data frame
fbHistoricResultsSQL <- as.data.frame(readRDS("files/fbHistoricResults.rds"))

# Start the timer
ptm <- proc.time()

# Subset the data
trainerlistsql <- sqldf("SELECT TRAINER FROM fbHistoricResultsSQL")

# Stop the timer
proc.time() - ptm  

I executed the sqldf version ten times, and this was the fastest return:

   user  system elapsed 
  2.781   0.008   2.785 

Not very encouraging, especially as dplyr is already my go to package for this type of work.

Next, something a little more complex. This time, just selecting one Trainer's records from the entire dataset.

library("dplyr")

# Load the data file as a data frame
fbHistoricResultsSQL <- as.data.frame(readRDS("files/fbHistoricResults.rds"))

# Start the timer
ptm <- proc.time()

# Subset the data for Jeremy Gask's runners only
trainerlist <- dplyr::select(fbHistoricResultsSQL, TRAINER)  
trainerlist <- dplyr::filter(trainerlist, grepl("J Gask", TRAINER))

# Stop the timer
proc.time() - ptm  

dplyr results for the slowest of ten runs:

   user  system elapsed 
  0.112   0.000   0.110 

And now sqldf doing the same thing:

library("sqldf")

# Load the data file as a data frame
fbHistoricResultsSQL <- as.data.frame(readRDS("files/fbHistoricResults.rds"))

# Start the timer
ptm <- proc.time()

# Subset the data for Jeremy Gask's runners only
trainerlistsql <- sqldf("SELECT TRAINER FROM fbHistoricResultsSQL WHERE TRAINER LIKE 'J GASK'")

# Stop the timer
proc.time() - ptm  

Results for the fastest of ten runs:

   user  system elapsed 
  2.716   0.000   2.717 

The results for dplyr and sqldf are slightly closer together for the more complex query, but it's obvious dplyr is faster. I don't think there's a need to dig too much deeper.

If you're new to R, you know SQL syntax and performance isn't very important, then sqldf has something to offer. It reduces the learning curve.

However, if you want faster code, use something like the very common and widely accepted dplyr package to manipulate data frames. Also consider the slightly more complex data.table package for potentially even better results on data tables.

sessionInfo()

R version 3.2.2 (2015-08-14)  
Platform: i486-pc-linux-gnu (32-bit)  
Running under: Debian GNU/Linux 8 (jessie)

locale:  
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:  
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:  
[1] sqldf_0.4-10  RSQLite_1.0.0 DBI_0.3.1     gsubfn_0.6-6  proto_0.3-10 

loaded via a namespace (and not attached):  
[1] tools_3.2.2  chron_2.3-47