Create an SQLite Database with R

Building an SQLite database using an R script is a reasonably straightforward process. However, depending on the dataset being used, it can be tedious and time consuming to perform correctly.

Why build an SQL database? When dealing with large amounts of data it is often quicker to perform an SQL query to return just specific columns, than it is to load a massive spreadsheet or CSV file containing the entire dataset. In relation to horse racing, this is true if working with several years of past results.

To begin, a CSV file is preferred. While it is possible to read other files, such as Excel spreadsheets, directly into R, when these are large files over several megabytes in size, R can bog down with memory availability issues. The dataset example used in this article is very simple and small in size. This is simply an aid to the demonstration rather than a real world example.

The dataset contains the following columns of completely spurious data:

Date
Horse
OffTime
Course
Trainer
Jockey
Result
SP

The example CSV file and completed R script is available for download from the links provided at the end of this article.

The first step in the R script is to load the required SQLite library. If this is not installed, do so first (Using RStudio, from the Tools menu select Install Packages).

require("RSQLite")

Next set a working directory where the CSV file is located. This is a useful step to avoid frustrations later when nothing seems to work because the file to import is not found, or the file written out is in an unknown location.

setwd("~/set/directory/path/to/suit")

Load the CSV file:

csvfilename <- "example.csv"

originalcsv <- read.csv(file=csvfilename, sep=",", header=TRUE, fileEncoding="latin1", stringsAsFactors=FALSE)

It is not essential to store the CSV filename in a separate variable name, but this can aid with clarity when scripts become complex or multiple files are in use.

The read.csv options set the CSV separator to a comma (assuming this is true for the file being used), includes any column headers, sets the encoding of the file and lastly instructs R to read strings as characters, rather than factors. This last part is important to include, otherwise changing data types and writing to the SQL database later will be more complex.

In some cases there may be a need to re-write column names. This will be the case if the original column names included items such as percentage symbols, currency symbols or began with numbers. If the CSV data has a large number of columns, this can become a chore. In the simple example used here this step is not required, but the task would be performed with the following line:

colnames(originalcsv) <- c("Date", "Horse", "RaceTime", "Course", "Trainer", "Jockey", "Result", "StartPrice")

OffTime has been changed to RaceTime and SP changed to StartPrice.

There are other methods within R to rename indivdual columns one at a time, but sometimes it works best to simply define all column names to ensure clarity.

Remember head(originalcsv) can always be used to view current data format.

The next step is very important to ensure subsequent queries, of the resulting database, return the expected results. Sometimes R reads in CSV fields as unexpected character types. The most common of these is where a numeric field is stored as a character. There are many reasons this could happen and diagnosing the root cause can be difficult. However, to ensure any mathematical calculations, carried out on this data when extracted from the database, works as expected, setting the relevant data types upfront is good practice. SQLite is also very particular about data formatting. Therefore, this field is also specifically altered prior to writing to the database.

To view the data types of each column within the R dataframe use the str(originalcsv) command.

originalcsv$Date <- as.Date(originalcsv$Date, format="%d/%m/%Y")
originalcsv$Date <- as.character(originalcsv$Date)
originalcsv$StartPrice <- as.numeric(originalcsv$StartPrice)

Now, write the data to the new database. If there is no database existing, one will automatically be created. If a database of the same name does already exist, R will attempt to append new lines to it. The date is again specifically transformed for SQLite. The new database is called example.sqlite and the single table within it is example_data.

con <- dbConnect(SQLite(), "example.sqlite")

dbWriteTable(con, name="example_data", value=transform(originalcsv, Date), row.names=FALSE, append=TRUE)

dbDisconnect(con)

The database is now created. However, it is worthwhile to run a sanity check to ensure everything worked as expected.

Open a new database connection, perform a query to return all data from the Horse column.

con <- dbConnect(SQLite(), "example.sqlite")

sql1 <- paste("SELECT example_data.Horse FROM example_data", sep="")

results <- dbGetQuery(con, sql1) 

dbDisconnect(con)

Now, count the rows from the returned database query and also the rows from the original CSV. If all is well, the two numbers should match.

dbrowcount <- aggregate(Horse ~ Horse, data = results, FUN = length)

csvrowcount <- nrow(originalcsv)

dbrowcount
csvrowcount

This very simple script has now created a working SQLite database, which can be used for further queries, at potentially quicker speeds than working with large CSV files directly within R.

The complete R code and example.csv file used in this article is available on GitHub.