Lesser Known R Features

A small collection of various lesser known R tricks and features.

built in constants

R has a small number of built in numeric constants, including Inf and pi. But there are also a several useful lists of often used names and abbreviations.

Letters:

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Month names:

month.name

##  [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"   
##  [9] "September" "October"   "November"  "December"
month.abb

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Information about the United States:

state.name

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"       "California"     "Colorado"      
##  [7] "Connecticut"    "Delaware"       "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"         "Kentucky"       "Louisiana"     
## [19] "Maine"          "Maryland"       "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"         "New Hampshire"  "New Jersey"    
## [31] "New Mexico"     "New York"       "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"     
## [43] "Texas"          "Utah"           "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
state.abb

##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA"
## [22] "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
## [43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"

Also available: state.regionstate.divisionstate.area and state.center.

initiating a matrix

Creating a placeholder matrix that later gets filled up is a reoccurring procedure. Below are several different ways to achieve different prepared 3×3 matrices.

matrix(1, 3, 3)

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
mat.or.vec(3, 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0
diag(3)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
.col(c(3,3))

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    3
## [3,]    1    2    3
.row(c(3,3))

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2
## [3,]    3    3    3
1:3 %o% 1:3

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    4    6
## [3,]    3    6    9

matrix element names

Elements inside a matrix can have their own names.

Start with a simple matrix that has column names and row names:

x <- matrix(1:9, nrow=3, dimnames=list(c("X1", "X2", "X3"), c("Y1", "Y2", "Y3")))
x

##    Y1 Y2 Y3
## X1  1  4  7
## X2  2  5  8
## X3  3  6  9

Selecting rows and columns is done using the standard subsetting notation:

x["X1",]

## Y1 Y2 Y3 
##  1  4  7
x[,"Y1"]

## X1 X2 X3 
##  1  2  3

But each element inside the matrix can have its own name too:

names(x) <- paste0("e", 1:9)
x

##    Y1 Y2 Y3
## X1  1  4  7
## X2  2  5  8
## X3  3  6  9
## attr(,"names")
## [1] "e1" "e2" "e3" "e4" "e5" "e6" "e7" "e8" "e9"

And then names can be used for selecting elements from the matrix.

x["e3"]

## e3 
##  3
x[c("e1", "e5", "e9")]

## e1 e5 e9 
##  1  5  9

array index format

Indices from a matrix can be obtained in a <row, column> form.

Starting with a simple matrix:

x <- matrix(1:6, nrow=2)
x

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Indices of elements higher than 3 can be obtained in a standard way:

which(x > 3)

## [1] 4 5 6

However there is another possible format – as a 2 dimensional table of rows and columns.

which(x > 3, arr.ind=TRUE)

##      row col
## [1,]   2   2
## [2,]   1   3
## [3,]   2   3

And this special format can also be used to select elements from a matrix.

inds <- rbind(c(1,2), c(2,1))
inds

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    1
x[inds]

## [1] 3 2

This can be especially useful in scenarios like converting graph representations between adjacency matrix and adjacency list of vertices.

With a weighted graph:

Am <- matrix(round(runif(25)), nrow=5) * round(rnorm(25), 2)
Am

##      [,1]  [,2]  [,3]  [,4] [,5]
## [1,] 1.03  0.00 -1.33  0.30    0
## [2,] 0.00  0.00 -1.05 -0.93    0
## [3,] 0.00  0.00  0.00  0.00    0
## [4,] 0.01  0.87  0.00  0.00    0
## [5,] 0.00 -0.01  0.00 -0.47    0

A list of all edges:

Al <- cbind(which(Am!=0, arr.ind=TRUE), w=Am[Am!=0])
Al

##       row col     w
##  [1,]   1   1  1.03
##  [2,]   4   1  0.01
##  [3,]   4   2  0.87
##  [4,]   5   2 -0.01
##  [5,]   1   3 -1.33
##  [6,]   2   3 -1.05
##  [7,]   1   4  0.30
##  [8,]   2   4 -0.93
##  [9,]   5   4 -0.47

Returning to matrix again:

Am <- matrix(0, nrow=5, ncol=5)
Am[Al[,1:2]] <- Al[,3]
Am

##      [,1]  [,2]  [,3]  [,4] [,5]
## [1,] 1.03  0.00 -1.33  0.30    0
## [2,] 0.00  0.00 -1.05 -0.93    0
## [3,] 0.00  0.00  0.00  0.00    0
## [4,] 0.01  0.87  0.00  0.00    0
## [5,] 0.00 -0.01  0.00 -0.47    0

elements in a nested list

Vector of indices can be used to select nested elements of a list.

A small list for demonstration:

a <- list(list(2,3), list(4,5))
a

## [[1]]
## [[1]][[1]]
## [1] 2
## 
## [[1]][[2]]
## [1] 3
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 4
## 
## [[2]][[2]]
## [1] 5

Selecting the second element from the second list can be done by using a single subset operation:

a[[c(2,2)]]

## [1] 5

matrix of lists

Matrix can contain various classes.

For example, here is a matrix of lists:

matrix(c(list(1), list(2), list('a'), list(c('b','c'))), ncol=2)

##      [,1] [,2]       
## [1,] 1    "a"        
## [2,] 2    Character,2

Or a matrix of data.frames

mat <- matrix(list(iris, mtcars, USArrests, chickwts), ncol=2)
mat

##      [,1]    [,2]  
## [1,] List,5  List,4
## [2,] List,11 List,2

Let’s see the first 5 rows of the last data.frame within this matrix:

head(mat[[2,2]])

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

means of rows and columns

Taking means of rows or columns of a matrix is an often repeated operation:

mat <- matrix(round(rnorm(50), 2), ncol=10)
mat

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] -2.21 -1.10  0.24 -0.91 -1.30 -0.15 -0.89  2.53  0.93  0.62
## [2,]  1.02  1.25 -1.51 -1.01  0.03  0.10  0.52  0.35 -0.68  1.04
## [3,]  0.90  0.73  2.16  0.52 -0.28 -0.36  0.55 -0.52 -1.52 -0.65
## [4,]  1.30 -0.56  0.28  0.63  0.81  0.29 -0.02 -0.17 -2.38 -0.93
## [5,]  0.23 -0.55 -0.30  0.06  0.02 -0.78  0.25 -1.31 -1.76  0.76
colMeans(mat)

##  [1]  0.248 -0.046  0.174 -0.142 -0.144 -0.180  0.082  0.176 -1.082  0.168

But R also has handy functions for repeating this operation on a flattened matrix, given that the dimensions are known.

vec <- as.numeric(mat)
vec

##  [1] -2.21  1.02  0.90  1.30  0.23 -1.10  1.25  0.73 -0.56 -0.55  0.24 -1.51  2.16  0.28 -0.30 -0.91 -1.01
## [18]  0.52  0.63  0.06 -1.30  0.03 -0.28  0.81  0.02 -0.15  0.10 -0.36  0.29 -0.78 -0.89  0.52  0.55 -0.02
## [35]  0.25  2.53  0.35 -0.52 -0.17 -1.31  0.93 -0.68 -1.52 -2.38 -1.76  0.62  1.04 -0.65 -0.93  0.76
.colMeans(vec, m=5, n=10)

##  [1]  0.248 -0.046  0.174 -0.142 -0.144 -0.180  0.082  0.176 -1.082  0.168

Equivalents also exist for .rowMeans.colSums and .rowSums.

split / unsplit

split() and unsplit() is a somewhat convenient way to do split-apply-combine tasks in base R.

Split iris dataset by Species:

dats <- split(iris, iris$Species)

Scale Sepal.Length to have mean 0 and variance 1 within each species separately.

dats <- lapply(dats, transform, Sepal.Length.Scaled=scale(Sepal.Length))

And recombine:

dats <- unsplit(dats, iris$Species)
head(dats)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Scaled
## 1          5.1         3.5          1.4         0.2  setosa          0.26667447
## 2          4.9         3.0          1.4         0.2  setosa         -0.30071802
## 3          4.7         3.2          1.3         0.2  setosa         -0.86811050
## 4          4.6         3.1          1.5         0.2  setosa         -1.15180675
## 5          5.0         3.6          1.4         0.2  setosa         -0.01702177
## 6          5.4         3.9          1.7         0.4  setosa          1.11776320

However there is also an alternative way using split()<- function:

split(iris$Sepal.Length, iris$Species) <- tapply(iris$Sepal.Length, iris$Species, scale)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.26667447         3.5          1.4         0.2  setosa
## 2  -0.30071802         3.0          1.4         0.2  setosa
## 3  -0.86811050         3.2          1.3         0.2  setosa
## 4  -1.15180675         3.1          1.5         0.2  setosa
## 5  -0.01702177         3.6          1.4         0.2  setosa
## 6   1.11776320         3.9          1.7         0.4  setosa

Or for all the columns in one go1:

split(iris[,1:4], iris$Species) <- Map(scale, split(iris[,1:4], iris$Species))
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.26667447   0.1899414   -0.3570112  -0.4364923  setosa
## 2  -0.30071802  -1.1290958   -0.3570112  -0.4364923  setosa
## 3  -0.86811050  -0.6014810   -0.9328358  -0.4364923  setosa
## 4  -1.15180675  -0.8652884    0.2188133  -0.4364923  setosa
## 5  -0.01702177   0.4537488   -0.3570112  -0.4364923  setosa
## 6   1.11776320   1.2451711    1.3704625   1.4613004  setosa

approximate pattern matching

grep() is an often used function to search for strings matching a specified pattern.

grep("^New", state.name, value=TRUE)

## [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

And, perhaps lesser known, agrep() allows approximate matching with mistakes.

agrep("New Jork", state.name, value=TRUE)

## [1] "New York"

repeating expressions

Taking an average of 10 random numbers can be done with a for loop:

res <- numeric(10)
for(i in 1:10) {
  res[i] <- mean(rnorm(10))
}

res

##  [1]  0.24478981  0.35531128 -0.26747828  0.23896692 -0.48851018 -0.07703305  0.09979184  0.04338666
##  [9]  0.08408048  0.20875063

And, perhaps more elegantly, with a sapply statement:

sapply(1:10, function(x) mean(rnorm(10)))

##  [1] -0.27061107 -0.44295415  0.27849580 -0.24784995 -0.31960419 -0.20858685  0.05398348 -0.78761535
##  [9]  0.06744736  0.19869348

However R also has a dedicated function, just for a task like this:

replicate(10, mean(rnorm(10)))

##  [1]  0.143974137 -0.220335083  0.003203138 -0.158991423 -0.454941456 -0.130968802  0.296338363 -0.170892634
##  [9] -0.131411247 -0.012410875

obtaining all combinations

Starting with 6 letters, how many different 2-letter combinations can be obtained, if order does not matter and without repeats?

choose(6, 2)

## [1] 15

Here they are:

combn(letters[1:6], 2)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
## [1,] "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "c"   "c"   "c"   "d"   "d"   "e"  
## [2,] "b"  "c"  "d"  "e"  "f"  "c"  "d"  "e"  "f"  "d"   "e"   "f"   "e"   "f"   "f"

Applying a function to each combination:

combn(letters[1:6], 2, FUN=function(x) paste(x, collapse="+"))

##  [1] "a+b" "a+c" "a+d" "a+e" "a+f" "b+c" "b+d" "b+e" "b+f" "c+d" "c+e" "c+f" "d+e" "d+f" "e+f"

And if the order does matter and repeats are allowed (showing first 10):

head(expand.grid(letters[1:6], letters[1:6]), 10)

##    Var1 Var2
## 1     a    a
## 2     b    a
## 3     c    a
## 4     d    a
## 5     e    a
## 6     f    a
## 7     a    b
## 8     b    b
## 9     c    b
## 10    d    b

changing values to NA

In a vector of 10 numbers:

x <- sample(10)
x

##  [1] 10  7  3  1  4  2  5  6  9  8

Typical way to change all values above 5 to “NA” is:

x[x>5] <- NA

However function is.na()<- provides a rarely used alternative way:

is.na(x) <- x > 5
x

##  [1] NA NA  3  1  4  2  5 NA NA NA

assigning operators

Possibility to create a custom infix operators by using the %...% syntax is well known. Here is an example of the operators opposite of %in%:

`%out%` <- function(x, y) !(x %in% y)

And a list of letters without vowels.

LETTERS[LETTERS %out% c("A", "E", "I", "O", "U")]

##  [1] "B" "C" "D" "F" "G" "H" "J" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W" "X" "Y" "Z"

It is also possible to create a custom assigning function, similar to names(x)<-. As an example here is a function that can replace the first element of a vector.

`first<-` <- function(x, value) c(value, x[-1])

It then can be used to replace the first element of any vector.

x <- 1:10
first(x) <- 0
x

##  [1]  0  2  3  4  5  6  7  8  9 10

However, a more surprising construct is a combination of the two. Here is an example of a function that replaces elements falling outside of specified set.

`%out%<-` <- function(x, y, value) {x[!(x %in% y)] <- value; x}

And here it is in action:

x <- 1:10
x %out% c(4,5,6,7) <- 0
x

##  [1] 0 0 0 4 5 6 7 0 0 0

Maybe even more surprising is that this can be used on standard operators (those without %...%). Below is a function that modifies the first argument of a product so that the product is equal to the given value.

`*<-` <- function(x, y, value) x*value/(x*y)
x <- 5
y <- 2

x * y

## [1] 10

Here is a line that modifies x so that the product of x and y is equal to 1.

x * y <- 1

x

## [1] 0.5
x * y

## [1] 1

And here is an even bigger contraption – assignment from both sides:

`<-<-` <- function(x, y, value) x <- paste0(y, "_", value)
"start" -> x <- "end"
x

## [1] "start_end"

multiple linear regressions

A somewhat hidden feature of lm() is that it accepts Y in a matrix format and does regression for each column separately. Doing it this way is also a lot faster compared to performing a separate lm() call for each column separately.

Example of regressing each variable in iris dataset against Species. This results in estimating the coefficients of 4 separate linear models.

lm(data.matrix(iris[,-5]) ~ iris$Species)

## 
## Call:
## lm(formula = data.matrix(iris[, -5]) ~ iris$Species)
## 
## Coefficients:
##                         Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
## (Intercept)             -8.346e-17     2.555e-16    3.243e-16     2.853e-16 
## iris$Speciesversicolor   1.316e-16    -5.809e-16    1.191e-16    -7.439e-16 
## iris$Speciesvirginica   -4.441e-17    -7.772e-16    1.998e-16     3.775e-16

color palette

R has over 650 named colors. Here are random 20 colors from that list:

head(sample(colors()), 20)

##  [1] "steelblue4"      "grey26"          "dodgerblue2"     "grey4"           "plum3"          
##  [6] "slateblue4"      "seashell3"       "darkgrey"        "grey42"          "azure2"         
## [11] "grey45"          "mediumorchid3"   "lightgoldenrod1" "skyblue4"        "azure"          
## [16] "darkolivegreen1" "orange1"         "gray14"          "indianred3"      "lightgray"

palette() allows to change the colors represented by numbers.

palette(c("cornflowerblue", "orange", "limegreen", "pink", "purple", "lightslategrey"))
pie(table(chickwts$feed), col=1:6)

To restore the colors:

palette("default")
pie(table(chickwts$feed), col=1:6)

color interpolation

Sometimes it is necessary to color a numeric variable by its value. For this purpose colorRamp can create a function that will interpolate a given set of colors to the [0,1] interval.

pal <- colorRamp(c("blue", "green", "orange", "red"))

Then we can obtain a color corresponding to any number between 0 and 1:

pal(0.5)

##       [,1] [,2] [,3]
## [1,] 127.5  210    0

Sadly we also need to transform it to an acceptable format first:

rgb(pal(0.5), max=255)

## [1] "#7FD200"

And here it is used to color the points by horse power:

# first - transform hp to a range 0-1
hp01 <- (mtcars$hp - min(mtcars$hp)) / diff(range(mtcars$hp))

plot(mtcars$hp, mtcars$mpg, pch=19, col=rgb(pal(hp01), max=255))

screens

Sometimes it is convenient to place a plot within a plot. One way to achieve this is with split.screen():

figs <- rbind(c(0.0, 1.0, 0.0, 1.0),
              c(0.3, 0.5, 0.6, 0.8)
              )
screenIDs <- split.screen(figs)

screen(screenIDs[1])
barplot(1:10, col="lightslategrey")

screen(screenIDs[2])
par(mar=c(0,0,0,0))
pie(1:5)

hooks

Hooks are a mechanism for injecting a function after a certain action takes place. They are sparsely used within R. For the demonstration plot.new hook2 will be used here.

This hook allows user to insert an action at the end of the plot.new() function. Here it will be used for adding a date stamp to every created plot.

setHook("plot.new", function() {mtext(Sys.Date(), 3, adj=1, xpd=TRUE)}, "append")

Now all plots should have a date:

par(mfrow=c(1,2))

plot(density(iris$Sepal.Width), lwd=2, col="lightslategrey", main="density")
pie(table(mtcars$gear))

the dollar operator

Dollar operator $ is used to select elements from a list by name. However it is a generic method and can be modified.

Here is a rewriting of $ operator to select rows, instead of columns, from data.frames3:

`$.data.frame` <- function(x, name) {x[rownames(x)==name,]}

And in action:

USArrests$Utah
##      Murder Assault UrbanPop Rape
## Utah    3.2     120       80 22.9

Auto-completion after pressing tab can also be added by rewriting the .DollarNames method:

.DollarNames.data.frame <- function(x, pattern="") {
  grep(pattern, rownames(x), value=TRUE)
}

Then after pressing tab on USArrests$A:

> USArrests$A
USArrests$Alabama   USArrests$Alaska    USArrests$Arizona   USArrests$Arkansas

To add more weirdness tab autocompletion can be made to auto-correct row name mistakes:

.DollarNames.data.frame <- function(x, pattern="") {
  agrep(pattern, rownames(x), value=TRUE, max.distance=0.25)
}

Then after pressing tab on USArrests$Kali

> USArrests$Kali
USArrests$California