…she’d be using dbplyr
In a number of previous posts I have mentioned the
dbplyr package as a really cool way to handle SQL databases in R.
dbplyr is a very clever SQL translator which is getting more and more powerful with every update. It allows you to treat your database table like it is an R dataframe object and to manipulate it using simple, clean tidyverse commands. It’s so much nicer and more intuitive that handling all those messy SQL strings in your R console.
I thought I would use a quick example to illustrate how
dbplyr works, and I hope you can pick this example up and play further with it. Before you know it, you’ll never want to write in SQL again!
Using the chinook database
chinook is a SQLite database which can be downloaded here, and it’s really easy to use for practice. Once you’ve unzipped it to a folder, get your R session open and we will use the following packages:
First we will establish the connection with the SQLite database. Although this database is simply a locally saved file, you would use a similar connection into your Oracle or MSSQL databases or whatever.
chinook <- DBI::dbConnect(
drv = RSQLite::SQLite(),
dbname = "chinook.db"
We now have an active connection into the
chinook database. We can browse the tables in there using
RSQLite::dbListTables(chinook) which tells that there are the following tables inside:
"albums" "artists" "customers" "employees" "genres" "invoice_items" "invoices" "media_types" "playlist_track" "playlists" "sqlite_sequence" "sqlite_stat1" "tracks"
We are going to play around with a few of these tables using
dbplyr — in particular the following:
employees— details of company employees
customers— table of customer details
invoices— table of invoice details attached to customer IDs
invoice_items— table of details of units ordered and prices in each invoice ID
We can easily set these tables up as database objects using
dplyr as follows:
employees <- dplyr::tbl(chinook, "employees")
customers <- dplyr::tbl(chinook, "customers")
invoices <- dplyr::tbl(chinook, "invoices")
invoice_items <- dplyr::tbl(chinook, "invoice_items")
Each of these tables are now SQL database objects in your R session which you can manipulate in the same way as a dataframe. Note that you can see an extract of the data in each table by simply typing the table name in your R console:
Querying using dbplyr
Now let’s try to do a really simple query, like how many employees do we have grouped by the year we hired them. If we were to write this as a straight SQL query, it would look like this:
qry <- "SELECT HireYear, COUNT(EmployeeId) FROM
(SELECT SUBSTR(HireDate, 1, 4) AS HireYear, EmployeeId
GROUP BY HireYear"
Now if we run
DBI::dbGetQuery(chinook, qry) we get this returned:
1 2002 3
2 2003 3
3 2004 2
But we can also do this using nice tidyverse code as follows:
employees_by_hire_year <- employees %>%
dplyr::mutate(HireYear = substr(HireDate, 1, 4)) %>%
summarize(Total = n())
Since this is a very small table, you can see it by simply typing its name in the console and it does indeed return the same table as the one just above. If it was larger, you’d have to collect it from the database — more on that later.
How does this magic seem to happen? Well, because
employees is recognized by R as a SQL database object, R invokes
dbplyr in the background to translate the piped tidyverse code into SQL. You can see the translated SQL using the
dbplyr::sql_render() function. So for example
dbplyr::sql_render(employees_by_hire_year) returns this:
<SQL> SELECT `HireYear`, COUNT() AS `Total`
FROM (SELECT `EmployeeId`, `LastName`, `FirstName`, `Title`, `ReportsTo`, `BirthDate`, `HireDate`, `Address`, `City`, `State`, `Country`, `PostalCode`, `Phone`, `Fax`, `Email`, SUBSTR(`HireDate`, 1, 4) AS `HireYear`
GROUP BY `HireYear`
Making complicated queries simple
The example above is pretty simple, and it’s not until your queries become quite complex that you really see the benefits of using
Let’s say that you want to work out how much customers spent per year on their orders. This means that you will have to extract the year from the Invoice Date, then join
InvoiceId, then group by year and then calculate the total unit price times quantity ordered in that year. Thanks to
dbplyr, we can just use our beautiful tidyverse grammar to do this:
amount_per_year <- invoices %>%
dplyr::mutate(InvoiceYear = substr(InvoiceDate, 1, 4) %>%
dplyr::inner_join(invoice_items, by = "InvoiceId") %>%
dplyr::mutate(ItemTotal = UnitPrice * Quantity) %>%
dplyr::summarise(Total = sum(ItemTotal))
This is so much nicer and easier to work with than its SQL equivalent. Note that in its current form
amount_per_year is still only a database object and the data has not yet been formally extracted from the database. To formally extract the data you can use
collect() function, as follows:
amount_per_year %>% dplyr::collect()
which will return:
# A tibble: 5 x 2
1 2009 449.
2 2010 481.
3 2011 470.
4 2012 478.
5 2013 451.
If you were working on a remote database as opposed to a local SQLite database, the added advantage is that
dbplyr ensures that all the manipulation is done on the database and you simply collect the results, rather than having to draw all that data down into your session.
See if you can get
dbplyr working for you on the
chinook dataset by trying these practice exercises:
- How many customers are there by Country?
- How many invoices were issued in the first quarter of 2009, grouped by the country of the customer?
- What were the names of the top ten tracks sold to customers in the USA? (You’ll also need the
trackstable for this one).
Here’s the SQL translation of that last
dbplyr query I wrote:
Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.