If Marie Kondo did SQL…

…she’d be using dbplyr

In a number of previous posts I have mentioned the dbplyr package as a really cool way to handle SQL databases in R. dbplyr is a very clever SQL translator which is getting more and more powerful with every update. It allows you to treat your database table like it is an R dataframe object and to manipulate it using simple, clean tidyverse commands. It’s so much nicer and more intuitive that handling all those messy SQL strings in your R console.

I thought I would use a quick example to illustrate how dbplyr works, and I hope you can pick this example up and play further with it. Before you know it, you’ll never want to write in SQL again!

Using the chinook database

chinook is a SQLite database which can be downloaded here, and it’s really easy to use for practice. Once you’ve unzipped it to a folder, get your R session open and we will use the following packages:

library(tidyverse)
library(dbplyr)
library(DBI)
library(RSQLite)

First we will establish the connection with the SQLite database. Although this database is simply a locally saved file, you would use a similar connection into your Oracle or MSSQL databases or whatever.

chinook <- DBI::dbConnect(
drv = RSQLite::SQLite(),
dbname = "chinook.db"
)

We now have an active connection into the chinook database. We can browse the tables in there using RSQLite::dbListTables(chinook) which tells that there are the following tables inside:

"albums"          "artists"         "customers"       "employees"       "genres"         "invoice_items"   "invoices"        "media_types"     "playlist_track"  "playlists"      "sqlite_sequence" "sqlite_stat1"    "tracks"

We are going to play around with a few of these tables using dbplyr — in particular the following:

  • employees — details of company employees
  • customers — table of customer details
  • invoices — table of invoice details attached to customer IDs
  • invoice_items — table of details of units ordered and prices in each invoice ID

We can easily set these tables up as database objects using dplyr as follows:

employees <- dplyr::tbl(chinook, "employees")
customers <- dplyr::tbl(chinook, "customers")
invoices <- dplyr::tbl(chinook, "invoices")
invoice_items <- dplyr::tbl(chinook, "invoice_items")

Each of these tables are now SQL database objects in your R session which you can manipulate in the same way as a dataframe. Note that you can see an extract of the data in each table by simply typing the table name in your R console:

Querying using dbplyr

Now let’s try to do a really simple query, like how many employees do we have grouped by the year we hired them. If we were to write this as a straight SQL query, it would look like this:

qry <- "SELECT HireYear, COUNT(EmployeeId) FROM
(SELECT SUBSTR(HireDate, 1, 4) AS HireYear, EmployeeId
FROM employees)
GROUP BY HireYear"

Now if we run DBI::dbGetQuery(chinook, qry) we get this returned:

   HireYear count(EmployeeId)
1 2002 3
2 2003 3
3 2004 2

But we can also do this using nice tidyverse code as follows:

employees_by_hire_year <- employees %>% 
dplyr::mutate(HireYear = substr(HireDate, 1, 4)) %>%
dplyr::group_by(HireYear) %>%
summarize(Total = n())

Since this is a very small table, you can see it by simply typing its name in the console and it does indeed return the same table as the one just above. If it was larger, you’d have to collect it from the database — more on that later.

How does this magic seem to happen? Well, because employees is recognized by R as a SQL database object, R invokes dbplyr in the background to translate the piped tidyverse code into SQL. You can see the translated SQL using the dbplyr::sql_render() function. So for example dbplyr::sql_render(employees_by_hire_year) returns this:

<SQL> SELECT `HireYear`, COUNT() AS `Total`
FROM (SELECT `EmployeeId`, `LastName`, `FirstName`, `Title`, `ReportsTo`, `BirthDate`, `HireDate`, `Address`, `City`, `State`, `Country`, `PostalCode`, `Phone`, `Fax`, `Email`, SUBSTR(`HireDate`, 1, 4) AS `HireYear`
FROM `employees`)
GROUP BY `HireYear`

Making complicated queries simple

The example above is pretty simple, and it’s not until your queries become quite complex that you really see the benefits of using dbplyr.

Let’s say that you want to work out how much customers spent per year on their orders. This means that you will have to extract the year from the Invoice Date, then join invoice to invoice_items using InvoiceId, then group by year and then calculate the total unit price times quantity ordered in that year. Thanks to dbplyr, we can just use our beautiful tidyverse grammar to do this:

amount_per_year <- invoices %>% 
dplyr::mutate(InvoiceYear = substr(InvoiceDate, 1, 4) %>%
as.integer()) %>%
dplyr::inner_join(invoice_items, by = "InvoiceId") %>%
dplyr::mutate(ItemTotal = UnitPrice * Quantity) %>%
dplyr::group_by(InvoiceYear) %>%
dplyr::summarise(Total = sum(ItemTotal))

This is so much nicer and easier to work with than its SQL equivalent. Note that in its current form amount_per_year is still only a database object and the data has not yet been formally extracted from the database. To formally extract the data you can use dplyr‘s collect() function, as follows:

amount_per_year %>% dplyr::collect()

which will return:

# A tibble: 5 x 2
InvoiceYear Total
<int> <dbl>
1 2009 449.
2 2010 481.
3 2011 470.
4 2012 478.
5 2013 451.

If you were working on a remote database as opposed to a local SQLite database, the added advantage is that dbplyr ensures that all the manipulation is done on the database and you simply collect the results, rather than having to draw all that data down into your session.

Practice exercises

See if you can get dbplyr working for you on the chinook dataset by trying these practice exercises:

  1. How many customers are there by Country?
  2. How many invoices were issued in the first quarter of 2009, grouped by the country of the customer?
  3. What were the names of the top ten tracks sold to customers in the USA? (You’ll also need the tracks table for this one).

Appendix

Here’s the SQL translation of that last dbplyr query I wrote:

Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.