Applying Yourself

Exercise

The `apply` family

R has a group of functions that all work in a similar fashion. Each of these functions takes a function as an argument and applies that function to each of the elements of one of its other arguments.
These functions can replace most for-loops in R and they are faster too.
The model function for this family of functions that takes a function as an argument and applies it to another one of its arguments is: apply.
(Programmers are not that creative with names - nor should they be!)

Not only are apply family functions generally faster than loops, but they are also more “R-like.” That is to say, when you think in terms of these functions you are thinking more like an R programmer. Just like different human languages have different figures of speech, idioms, and modes of thinking that characterize them, so do computer languages.

As we have discussed previously, R lends itself to functional types of programming idioms, one of which is called mapping. Mapping means taking a function and “mapping” it or “applying” it to every sub-element of vector, list, or array. So the apply family is a very “functional” programming idiom.

So, let’s get started with the apply function. The apply function works on the rows or columns of an data.frame or array. We have not met an array before but they are essentially a vector that has 2 or more dimensions (e.g., a table has two dimensions) of all the same atomic data type.
We will create one that contains 10,000 coin flip experiments using 100 flips in each experiment.
The function rbinom generates random binomially distributed data.

# generates 100 fair coin flips
rbinom(100, 1, 0.5)

##   [1] 0 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 0
##  [36] 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 0 1
##  [71] 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1

# do it 10,000 times!
experiments <- replicate(10000, rbinom(100, 1, 0.5))
# first 10 rows and first 10 columns
experiments[1:10, 1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    1    0    0    0    1    0    1    0     0
##  [2,]    0    1    1    0    0    1    0    0    1     1
##  [3,]    0    0    0    0    1    0    0    0    0     1
##  [4,]    1    1    1    1    0    0    0    1    0     1
##  [5,]    0    0    0    0    1    0    1    0    0     1
##  [6,]    1    0    1    0    0    1    0    1    1     1
##  [7,]    1    0    0    0    0    1    0    0    1     1
##  [8,]    1    1    0    0    1    1    1    1    1     1
##  [9,]    1    0    1    0    0    0    0    1    1     1
## [10,]    1    1    1    1    1    1    1    1    1     1

# what size is it?
NROW(experiments)

## [1] 100

NCOL(experiments)

## [1] 10000

dim(experiments)

## [1]   100 10000

So we have 10,000 experiments of 100 coin flips each. Each expermient is in one column (because there are 100 rows). If we wanted to plot a histogram of how many heads were flipped (calling 1 as heads and 0 as tails) then we could use the apply function to add each column up before calling hist:

# the 2 represents the columns (remember they come "second" when
# talking about rows vs. columns); you would use 1 instead if you
# wanted to apply the function to the rows
heads <- apply(experiments, 2, sum)
length(heads)

## [1] 10000

hist(heads)

mean(heads)

## [1] 49.9342

var(heads)

## [1] 24.90696

`lapply`

You probably recall from your basic statistics class that as the sample size increases our confidence in the mean increases. That means that the standard deviation / variance around the estimate of the mean decreases. Let’s test that in R. We could do the following, but it is a lot of typing to do almost the same thing:

sd(replicate(1000, mean(rnorm(10,   100, 25)))) # 1000 experiments of n=10

## [1] 8.23617

sd(replicate(1000, mean(rnorm(100,  100, 25)))) # 1000 experiments of n=100

## [1] 2.445228

sd(replicate(1000, mean(rnorm(1000, 100, 25)))) # 1000 experiments of n=1000

## [1] 0.7706157

What if we converted the second part of the experession to a function like this?

rsd <- function(n) sd(replicate(1000, mean(rnorm(n, 100, 25))))

Then, we could use the lapply function to complete these three lines of code in one:

lapply(c(10, 100, 1000), rsd)

## [[1]]
## [1] 7.967816
## 
## [[2]]
## [1] 2.404179
## 
## [[3]]
## [1] 0.7704122

It can get annoying to name little utility functions like rmean so it turns out that in R you can just use what is called lambda functions, i.e., unnamed functions. So, you could do the above like this:

lapply(c(10, 100, 1000), function(n) sd(replicate(1000, mean(rnorm(n, 100, 25)))))

## [[1]]
## [1] 7.582292
## 
## [[2]]
## [1] 2.4946
## 
## [[3]]
## [1] 0.768337

That way you do not have to name the other function. It is your call when you name the function and use a lambda function.
Most advanced programmers would do the latter in this case because it is short enough to comprehend easily, but either is absolutely acceptable.

Now, let’s return not only the variance but the mean. Maybe here the function is complicated enough that it is worth naming separately for clarity:

rmv <- function(n) {
  r <- replicate(1000, mean(rnorm(n, 100, 25)))
  list(mean = mean(r), var = var(r))
}
lapply(c(10, 100, 1000), rmv)

## [[1]]
## [[1]]$mean
## [1] 100.2671
## 
## [[1]]$var
## [1] 60.95239
## 
## 
## [[2]]
## [[2]]$mean
## [1] 99.99799
## 
## [[2]]$var
## [1] 6.910136
## 
## 
## [[3]]
## [[3]]$mean
## [1] 100.0094
## 
## [[3]]$var
## [1] 0.682358

Even more compact - and adding one more power of 10?

lapply(10^(1:4), rmv)

## [[1]]
## [[1]]$mean
## [1] 100.2189
## 
## [[1]]$var
## [1] 62.61557
## 
## 
## [[2]]
## [[2]]$mean
## [1] 100.0215
## 
## [[2]]$var
## [1] 6.039986
## 
## 
## [[3]]
## [[3]]$mean
## [1] 100.0235
## 
## [[3]]$var
## [1] 0.6179419
## 
## 
## [[4]]
## [[4]]$mean
## [1] 100.0015
## 
## [[4]]$var
## [1] 0.06450738

This is less error prone, requires less typing, and is much more clear than cutting and pasting a lot of code.

When you return a single value you can use sapply to return a vector instead of a list:

sapply(10^(1:3), rsd)

## [1] 8.0070338 2.5318048 0.7712518

or an array when you return more than one:

sapply(10^(1:3), rmv)

##      [,1]     [,2]     [,3]     
## mean 99.48364 100.0085 99.99947 
## var  58.12069 6.21246  0.6050428

Are you feeling ragged yet?

You will quite frequently run into datasets like the following one listing the ocular photographs of patients, each with a variable number of photographs:

library(adm)
head(photoqual)

##   ptid eye order quality sex age
## 1    1  OD     1       3   F  97
## 2    1  OS     2       2   F  97
## 3    1  OS     3       2   F  97
## 4    1  OS     4       2   F  97
## 5    2  OS     1       4   F  70
## 6    2  OS     2       3   F  70

It may be tempting to run the summary function on the dataset, but you cannot do that in this case. You would get a weighted average of age by the number of photographs that the patient had taken. Not the correct thing to report! Plus, the observations are not independent so your inferential statistics would be all messed up.

The tapply function comes to the rescue because it allows a function to be applied to a ragged array, i.e., an unbalanced or an inconsistent number of observations in each set/array. So, what if you wanted to count how many photographs each patient had on each eye? Here’s how you would do it:

tapply(photoqual$eye, photoqual$ptid,function(x) summary(x))

## $`1`
## OD OS 
##  1  3 
## 
## $`2`
## OD OS 
##  1  4 
## 
## $`3`
## OD OS 
##  3  0 
## 
## $`4`
## OD OS 
##  6  3 
## 
## $`5`
## OD OS 
##  2  0 
## 
## $`6`
## OD OS 
##  4  3 
## 
## $`7`
## OD OS 
##  3  4 
## 
## $`8`
## OD OS 
##  0  4 
## 
## $`9`
## OD OS 
##  3  3 
## 
## $`10`
## OD OS 
##  1  2

Pretty amazing, eh? I think this is even more amazing, it uses the do.call function which takes a function name and applies it to a list (the second argument):

do.call(rbind, tapply(photoqual$eye, photoqual$ptid, function(x) summary(x)))

##    OD OS
## 1   1  3
## 2   1  4
## 3   3  0
## 4   6  3
## 5   2  0
## 6   4  3
## 7   3  4
## 8   0  4
## 9   3  3
## 10  1  2

do.call is useful when you are trying to apply a function like rbind that takes ... as an argument (like sum but not like mean - take a look at the help for each).

By the way, another function called aggregate can do the same thing as the do.call + tapply combination in this case (and is probably better in this instance because you probably want the patient identifer associated with the answer):

aggregate(photoqual$eye, photoqual['ptid'], function(x) summary(x))

##    ptid x.OD x.OS
## 1     1    1    3
## 2     2    1    4
## 3     3    3    0
## 4     4    6    3
## 5     5    2    0
## 6     6    4    3
## 7     7    3    4
## 8     8    0    4
## 9     9    3    3
## 10   10    1    2

However, tapply is a good way to capture a single variable for all patients in a dataset like this. For example, here is the correct summary of the age for these patients:

tapply(photoqual$age, photoqual$ptid,function(x) head(x,1))

##  1  2  3  4  5  6  7  8  9 10 
## 97 70 31 81 30 81 44 49 66 41

summary(tapply(photoqual$age, photoqual$ptid, function(x) head(x,1)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   41.75   57.50   59.00   78.25   97.00

You are not satisfied with just one argument?!?

There are several other apply family functions. Several are beyond the scope of this class, but as you advance in your R programming, I want you to know they are there and to give you a brief taste of what is possible with one of them. The major apply functions that we are “skipping” are: mapply, rapply, vapply, and eapply. I would recommend that if you want to learn more that you proceed in that order. Here is a taste of mapply which uses each of the arguments after the function in order and walks through both simultaneously similar to a previous example:

mapply(rep, 1:4, 4:1)

## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

# not the same as:
rep(1:4, 4:1)

##  [1] 1 1 1 1 2 2 2 3 3 4

# but the same as:
list(
  rep(1, 4),
  rep(2, 3),
  rep(3, 2),
  rep(4, 1)
)

## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

Useful? You betcha! Is that obvious to you now? Probably not… but be on the lookout. The apply family can save you enormous amounts of time and can make you feel like you have a magic wand for data.

Advanced tables

Markdown tables are fairly limited. However you can create them programmatically as follows:

library(knitr)
tab <- do.call(rbind, tapply(photoqual$eye, photoqual$ptid, function(x) summary(x)))
kable(tab)

OD	OS
1	3
1	4
3	0
6	3
2	0
4	3
3	4
0	4
3	3
1	2

What you cannot see is that the block of R markdown is specified as {r, results = "asis"} instead of just {r}. You only need to call the knitr library once in a given file.

However, if you want to span columns or rows then you will need HTML, the language of the World Wide Web. Teaching you a lot about HTML is beyond the scope of this class, but essentially R Markdown allows raw HTML if you specify it directly or if your R code returns valid HTML in an {r, results = "asis"} block:

cat("This is <b>bold</b> and this is <i>italics</i>.")

This is bold and this is italics.

cat is a function very similar to print but is even more “raw” than print essentially just spitting out exactly what you give it. You can separate multiple things to print out by commas.

Here is an excellent tutorial on HTML tables: http://www.w3schools.com/html/html_tables.asp. Ignore CSS for now (that’s how you do special formatting) and focus on the main table tags: table (overall table), tr (table row), th (table heading), td (table data).
With those tags and their attributes, you can build fancy tables:

cat('<table>
<tr><th>Animal</th><th colspan="2">Size</th></tr>
<tr><td></td><td>Small</td><td>Big</td></tr>
<tr><td>Dogs</td><td>5</td><td>10</td></tr>
<tr><td>Cats</td><td>20</td><td>3</td></tr>
</table>')

Animal	Size
	Small	Big
Dogs	5	10
Cats	20	3

That is a lot of typing. So I have written a html function in the adm package to make it “easier” at least for the reptative row elements:

cat("<table>",
    '<tr><th>Animal</th><th colspan="2">Size</th></tr>',
    html(
      list(
          html(list("Animal", "Small", "Big"), "td"),
          html(list("Dogs", 5, 10), "td"),
          html(list("Cats", 20, 3), "td")
      ), 
      "tr"
    ),
  "</table>"
)

Animal	Size
Animal	Small	Big
Dogs	5	10
Cats	20	3

Maybe that doesn’t look any easier, but that’s where the apply family comes to the rescue for bigger data:

animals <- data.frame(Animal = c("Dogs", "Cats", "Birds", "Elephants"),
                      Small = c(5, 10, 15, 0),
                      Big = c(10, 3, 10, 20))
animals

##      Animal Small Big
## 1      Dogs     5  10
## 2      Cats    10   3
## 3     Birds    15  10
## 4 Elephants     0  20

cat("<table>",
    '<tr><th>Animal</th><th colspan="2">Size</th></tr>',
    html(html(names(animals), "td"), "tr"),
    sapply(apply(animals, 1, html, "td"), html, "tr"),
    "</table>"
)

Animal	Size
Animal	Small	Big
Dogs	5	10
Cats	10	3
Birds	15	10
Elephants	0	20

An additional text function that you will likely find helpful is sprintf. It has a little bit of a complicated syntax, but it is very powerful for formatting your data as text:

cat("<table>",
    '<tr><th>Animal</th><th colspan="4">Size</th></tr>',
    '<tr><td></td><td>Big: n</td><td>(%)</td><td>Small: n</td><td>(%)</td><tr>',
    sapply(
      apply(animals, 1,
            function(row) {
              html(
                list(row[1],
                  row[2],
                  sprintf("(%.1f)", as.numeric(row[2])*100/sum(as.numeric(row[2:3]))),
                  row[3],
                  sprintf("(%.1f)", as.numeric(row[3])*100/sum(as.numeric(row[2:3])))),
                "td"
              )
            }),
      html,
      "tr"
      ),
    "</table>"
)

Animal	Size
	Big: n	(%)	Small: n	(%)
Dogs	5	(33.3)	10	(66.7)
Cats	10	(76.9)	3	(23.1)
Birds	15	(60.0)	10	(40.0)
Elephants	0	(0.0)	20	(100.0)

Read the help for sprintf for more detail.