Compare R vs Clojure CSV Import

Various reasons have lead me to the world of Clojure. Learning a new programming language is, for me, not something to be undertaken lightly. It took much research and hefty deliberation before settling on Clojure. A few key reasons include:

  • I wanted to learn a functional language and I thought my R experience would help
  • I was keen for a language with decent web frameworks and there appear to be some good ClojureScript options.
  • I wanted a language with an active community and plenty of examples on StackOverflow. Clojure developers certainly are passionate!
  • I'm probably a Type B personality.

Enough background information. One key task I wish to use Clojure for is data analysis. The first step is therefore to import some CSV data into the current namespace. In R this is pretty straightforward, either with the base library functions or some of the contributed, faster libraries.

I've covered importing CSV data with R previously. As a quick recap, here are some simple benchmarks on a 35MB file of thoroughbred bloodstock data, with approximately 154,000 individual records.

library(data.table)  
library(readr)  
library(microbenchmark)

microbenchmark(times = 1,  
freadCSV <- fread("bloodstockSalesData.csv", sep = ","),  
readrCSV <- read_csv("bloodstockSalesData.csv"),  
baseCSV <- read.csv("bloodstockSalesData.csv", sep=",", as.is = TRUE)  
)

Simplified Output:
Units: milliseconds

  • fread: 519.2403
  • readr: 1073.6215
  • Base read.csv: 3991.0043

As expected, data.table's fread is quickest.

There is an interesting collection of tools in Clojure packaged as Incanter. This collection of libraries bills itself as a "Clojure-based, R-like platform for statistical computing and graphics." That seems like an obvious place to start with importing a decent sized CSV.

This is not the post to go too deeply into the structure of Clojure and how to initiate everything. Very, very briefly, I used Leiningen for dependencies, Atom was the IDE/editor, accompanied by the proto-repl plugin.

Each time the REPL is started, a Java Virtual Machine (JVM) is launched in the background. This is because Clojure is compiled to Java bytecode. Therefore, two timing tests were run. The first includes launching the REPL and the second with the REPL already running.

(use 'incanter.io)
(time
  (def incanter-csv-test (read-dataset "bloodstockSalesData.csv" :header true)))

;; First run as part of REPL start
Elapsed time: 23349.566369 msecs

;; Second execution with REPL already running
Elapsed time: 22193.087678 msecs  

The second run was fractionally quicker, but loading this CSV with Incanter is significantly slower than any R based method.

Is there a better way? After some research another couple of libraries were found. The clojure-csv library seemed to be the better, although documentation is a little sparse (assumptions are made that people know what they're doing!).

(require '[clojure-csv.core :as csv])
(time
  (def clojure-csv-test
      (csv/parse-csv
        (clojure.java.io/reader "bloodstockSalesData.csv"))))

;; First run as part of REPL start
Elapsed time: 4.214464 msecs

;; Second execution with REPL already running
Elapsed time: 0.627183 msecs  

Wow! Impressively fast on both runs. Was the data really imported that quickly? Checking the first line of the variable clojure-csv-test appears to indicate the data is present:

(first clojure-csv-test)
=> ["" "Lot" "Name" "Foaled" "Sex" "Type" "Colour" "Sire" "Dam" "Consignor" "Stabling" "Purchaser" "coveringSire" "Catalogue" "Price" "Auctioneer" "Country" "Currency" "saleDate" "Sale"]

As a very early adventure in Clojure, it's pleasing to have gone beyond the obvious with Incanter and found a better solution. Data imports of such speed bode well for future steps of data analysis and visualisation.