apply
familyR has a group of functions that all work in a similar fashion. Each of these functions takes a function as an argument and applies that function to each of the elements of one of its other arguments.
These functions can replace most for
-loops in R and they are faster too.
The model function for this family of functions that takes a function as an argument and applies it to another one of its arguments is: apply
.
(Programmers are not that creative with names - nor should they be!)
Not only are apply
family functions generally faster than loops, but they are also more “R-like.” That is to say, when you think in terms of these functions you are thinking more like an R programmer. Just like different human languages have different figures of speech, idioms, and modes of thinking that characterize them, so do computer languages.
As we have discussed previously, R lends itself to functional types of programming idioms, one of which is called mapping. Mapping means taking a function and “mapping” it or “applying” it to every sub-element of vector, list, or array. So the apply
family is a very “functional” programming idiom.
So, let’s get started with the apply
function. The apply
function works on the rows or columns of an data.frame
or array
. We have not met an array
before but they are essentially a vector
that has 2 or more dimensions (e.g., a table has two dimensions) of all the same atomic data type.
We will create one that contains 10,000 coin flip experiments using 100 flips in each experiment.
The function rbinom
generates random binomially distributed data.
# generates 100 fair coin flips
rbinom(100, 1, 0.5)
## [1] 0 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 0
## [36] 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 0 1
## [71] 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1
# do it 10,000 times!
experiments <- replicate(10000, rbinom(100, 1, 0.5))
# first 10 rows and first 10 columns
experiments[1:10, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 0 0 0 1 0 1 0 0
## [2,] 0 1 1 0 0 1 0 0 1 1
## [3,] 0 0 0 0 1 0 0 0 0 1
## [4,] 1 1 1 1 0 0 0 1 0 1
## [5,] 0 0 0 0 1 0 1 0 0 1
## [6,] 1 0 1 0 0 1 0 1 1 1
## [7,] 1 0 0 0 0 1 0 0 1 1
## [8,] 1 1 0 0 1 1 1 1 1 1
## [9,] 1 0 1 0 0 0 0 1 1 1
## [10,] 1 1 1 1 1 1 1 1 1 1
# what size is it?
NROW(experiments)
## [1] 100
NCOL(experiments)
## [1] 10000
dim(experiments)
## [1] 100 10000
So we have 10,000 experiments of 100 coin flips each. Each expermient is in one column (because there are 100 rows). If we wanted to plot a histogram of how many heads were flipped (calling 1 as heads and 0 as tails) then we could use the apply function to add each column up before calling hist
:
# the 2 represents the columns (remember they come "second" when
# talking about rows vs. columns); you would use 1 instead if you
# wanted to apply the function to the rows
heads <- apply(experiments, 2, sum)
length(heads)
## [1] 10000
hist(heads)
mean(heads)
## [1] 49.9342
var(heads)
## [1] 24.90696
lapply
You probably recall from your basic statistics class that as the sample size increases our confidence in the mean increases. That means that the standard deviation / variance around the estimate of the mean decreases. Let’s test that in R. We could do the following, but it is a lot of typing to do almost the same thing:
sd(replicate(1000, mean(rnorm(10, 100, 25)))) # 1000 experiments of n=10
## [1] 8.23617
sd(replicate(1000, mean(rnorm(100, 100, 25)))) # 1000 experiments of n=100
## [1] 2.445228
sd(replicate(1000, mean(rnorm(1000, 100, 25)))) # 1000 experiments of n=1000
## [1] 0.7706157
What if we converted the second part of the experession to a function like this?
rsd <- function(n) sd(replicate(1000, mean(rnorm(n, 100, 25))))
Then, we could use the lapply
function to complete these three lines of code in one:
lapply(c(10, 100, 1000), rsd)
## [[1]]
## [1] 7.967816
##
## [[2]]
## [1] 2.404179
##
## [[3]]
## [1] 0.7704122
It can get annoying to name little utility functions like rmean
so it turns out that in R you can just use what is called lambda functions, i.e., unnamed functions. So, you could do the above like this:
lapply(c(10, 100, 1000), function(n) sd(replicate(1000, mean(rnorm(n, 100, 25)))))
## [[1]]
## [1] 7.582292
##
## [[2]]
## [1] 2.4946
##
## [[3]]
## [1] 0.768337
That way you do not have to name the other function. It is your call when you name the function and use a lambda function.
Most advanced programmers would do the latter in this case because it is short enough to comprehend easily, but either is absolutely acceptable.
Now, let’s return not only the variance but the mean. Maybe here the function is complicated enough that it is worth naming separately for clarity:
rmv <- function(n) {
r <- replicate(1000, mean(rnorm(n, 100, 25)))
list(mean = mean(r), var = var(r))
}
lapply(c(10, 100, 1000), rmv)
## [[1]]
## [[1]]$mean
## [1] 100.2671
##
## [[1]]$var
## [1] 60.95239
##
##
## [[2]]
## [[2]]$mean
## [1] 99.99799
##
## [[2]]$var
## [1] 6.910136
##
##
## [[3]]
## [[3]]$mean
## [1] 100.0094
##
## [[3]]$var
## [1] 0.682358
Even more compact - and adding one more power of 10?
lapply(10^(1:4), rmv)
## [[1]]
## [[1]]$mean
## [1] 100.2189
##
## [[1]]$var
## [1] 62.61557
##
##
## [[2]]
## [[2]]$mean
## [1] 100.0215
##
## [[2]]$var
## [1] 6.039986
##
##
## [[3]]
## [[3]]$mean
## [1] 100.0235
##
## [[3]]$var
## [1] 0.6179419
##
##
## [[4]]
## [[4]]$mean
## [1] 100.0015
##
## [[4]]$var
## [1] 0.06450738
This is less error prone, requires less typing, and is much more clear than cutting and pasting a lot of code.
When you return a single value you can use sapply
to return a vector
instead of a list
:
sapply(10^(1:3), rsd)
## [1] 8.0070338 2.5318048 0.7712518
or an array
when you return more than one:
sapply(10^(1:3), rmv)
## [,1] [,2] [,3]
## mean 99.48364 100.0085 99.99947
## var 58.12069 6.21246 0.6050428
You will quite frequently run into datasets like the following one listing the ocular photographs of patients, each with a variable number of photographs:
library(adm)
head(photoqual)
## ptid eye order quality sex age
## 1 1 OD 1 3 F 97
## 2 1 OS 2 2 F 97
## 3 1 OS 3 2 F 97
## 4 1 OS 4 2 F 97
## 5 2 OS 1 4 F 70
## 6 2 OS 2 3 F 70
It may be tempting to run the summary
function on the dataset, but you cannot do that in this case. You would get a weighted average of age by the number of photographs that the patient had taken. Not the correct thing to report! Plus, the observations are not independent so your inferential statistics would be all messed up.
The tapply
function comes to the rescue because it allows a function to be applied to a ragged array, i.e., an unbalanced or an inconsistent number of observations in each set/array. So, what if you wanted to count how many photographs each patient had on each eye? Here’s how you would do it:
tapply(photoqual$eye, photoqual$ptid,function(x) summary(x))
## $`1`
## OD OS
## 1 3
##
## $`2`
## OD OS
## 1 4
##
## $`3`
## OD OS
## 3 0
##
## $`4`
## OD OS
## 6 3
##
## $`5`
## OD OS
## 2 0
##
## $`6`
## OD OS
## 4 3
##
## $`7`
## OD OS
## 3 4
##
## $`8`
## OD OS
## 0 4
##
## $`9`
## OD OS
## 3 3
##
## $`10`
## OD OS
## 1 2
Pretty amazing, eh? I think this is even more amazing, it uses the do.call
function which takes a function name and applies it to a list (the second argument):
do.call(rbind, tapply(photoqual$eye, photoqual$ptid, function(x) summary(x)))
## OD OS
## 1 1 3
## 2 1 4
## 3 3 0
## 4 6 3
## 5 2 0
## 6 4 3
## 7 3 4
## 8 0 4
## 9 3 3
## 10 1 2
do.call
is useful when you are trying to apply
a function like rbind
that takes ...
as an argument (like sum
but not like mean
- take a look at the help for each).
By the way, another function called aggregate
can do the same thing as the do.call
+ tapply
combination in this case (and is probably better in this instance because you probably want the patient identifer associated with the answer):
aggregate(photoqual$eye, photoqual['ptid'], function(x) summary(x))
## ptid x.OD x.OS
## 1 1 1 3
## 2 2 1 4
## 3 3 3 0
## 4 4 6 3
## 5 5 2 0
## 6 6 4 3
## 7 7 3 4
## 8 8 0 4
## 9 9 3 3
## 10 10 1 2
However, tapply
is a good way to capture a single variable for all patients in a dataset like this. For example, here is the correct summary
of the age for these patients:
tapply(photoqual$age, photoqual$ptid,function(x) head(x,1))
## 1 2 3 4 5 6 7 8 9 10
## 97 70 31 81 30 81 44 49 66 41
summary(tapply(photoqual$age, photoqual$ptid, function(x) head(x,1)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 41.75 57.50 59.00 78.25 97.00
There are several other apply
family functions. Several are beyond the scope of this class, but as you advance in your R programming, I want you to know they are there and to give you a brief taste of what is possible with one of them. The major apply functions that we are “skipping” are: mapply
, rapply
, vapply
, and eapply
. I would recommend that if you want to learn more that you proceed in that order. Here is a taste of mapply
which uses each of the arguments after the function in order and walks through both simultaneously similar to a previous example:
mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
# not the same as:
rep(1:4, 4:1)
## [1] 1 1 1 1 2 2 2 3 3 4
# but the same as:
list(
rep(1, 4),
rep(2, 3),
rep(3, 2),
rep(4, 1)
)
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
Useful? You betcha! Is that obvious to you now? Probably not… but be on the lookout. The apply family can save you enormous amounts of time and can make you feel like you have a magic wand for data.
Markdown tables are fairly limited. However you can create them programmatically as follows:
library(knitr)
tab <- do.call(rbind, tapply(photoqual$eye, photoqual$ptid, function(x) summary(x)))
kable(tab)
OD | OS |
---|---|
1 | 3 |
1 | 4 |
3 | 0 |
6 | 3 |
2 | 0 |
4 | 3 |
3 | 4 |
0 | 4 |
3 | 3 |
1 | 2 |
What you cannot see is that the block of R markdown is specified as {r, results = "asis"}
instead of just {r}
. You only need to call the knitr
library once in a given file.
However, if you want to span columns or rows then you will need HTML, the language of the World Wide Web. Teaching you a lot about HTML is beyond the scope of this class, but essentially R Markdown allows raw HTML if you specify it directly or if your R code returns valid HTML in an {r, results = "asis"}
block:
cat("This is <b>bold</b> and this is <i>italics</i>.")
This is bold and this is italics.
cat
is a function very similar to print
but is even more “raw” than print essentially just spitting out exactly what you give it. You can separate multiple things to print out by commas.
Here is an excellent tutorial on HTML tables: http://www.w3schools.com/html/html_tables.asp. Ignore CSS for now (that’s how you do special formatting) and focus on the main table tags: table
(overall table), tr
(table row), th
(table heading), td
(table data).
With those tags and their attributes, you can build fancy tables:
cat('<table>
<tr><th>Animal</th><th colspan="2">Size</th></tr>
<tr><td></td><td>Small</td><td>Big</td></tr>
<tr><td>Dogs</td><td>5</td><td>10</td></tr>
<tr><td>Cats</td><td>20</td><td>3</td></tr>
</table>')
Animal | Size | |
---|---|---|
Small | Big | |
Dogs | 5 | 10 |
Cats | 20 | 3 |
That is a lot of typing. So I have written a html
function in the adm
package to make it “easier” at least for the reptative row elements:
cat("<table>",
'<tr><th>Animal</th><th colspan="2">Size</th></tr>',
html(
list(
html(list("Animal", "Small", "Big"), "td"),
html(list("Dogs", 5, 10), "td"),
html(list("Cats", 20, 3), "td")
),
"tr"
),
"</table>"
)
Animal | Size | |
---|---|---|
Animal | Small | Big |
Dogs | 5 | 10 |
Cats | 20 | 3 |
Maybe that doesn’t look any easier, but that’s where the apply
family comes to the rescue for bigger data:
animals <- data.frame(Animal = c("Dogs", "Cats", "Birds", "Elephants"),
Small = c(5, 10, 15, 0),
Big = c(10, 3, 10, 20))
animals
## Animal Small Big
## 1 Dogs 5 10
## 2 Cats 10 3
## 3 Birds 15 10
## 4 Elephants 0 20
cat("<table>",
'<tr><th>Animal</th><th colspan="2">Size</th></tr>',
html(html(names(animals), "td"), "tr"),
sapply(apply(animals, 1, html, "td"), html, "tr"),
"</table>"
)
Animal | Size | |
---|---|---|
Animal | Small | Big |
Dogs | 5 | 10 |
Cats | 10 | 3 |
Birds | 15 | 10 |
Elephants | 0 | 20 |
An additional text function that you will likely find helpful is sprintf
. It has a little bit of a complicated syntax, but it is very powerful for formatting your data as text:
cat("<table>",
'<tr><th>Animal</th><th colspan="4">Size</th></tr>',
'<tr><td></td><td>Big: n</td><td>(%)</td><td>Small: n</td><td>(%)</td><tr>',
sapply(
apply(animals, 1,
function(row) {
html(
list(row[1],
row[2],
sprintf("(%.1f)", as.numeric(row[2])*100/sum(as.numeric(row[2:3]))),
row[3],
sprintf("(%.1f)", as.numeric(row[3])*100/sum(as.numeric(row[2:3])))),
"td"
)
}),
html,
"tr"
),
"</table>"
)
Animal | Size | |||
---|---|---|---|---|
Big: n | (%) | Small: n | (%) | |
Dogs | 5 | (33.3) | 10 | (66.7) |
Cats | 10 | (76.9) | 3 | (23.1) |
Birds | 15 | (60.0) | 10 | (40.0) |
Elephants | 0 | (0.0) | 20 | (100.0) |
Read the help for sprintf
for more detail.
What do you expect the average number of heads is if you run apply
on the rows rather than the columns of experiments is? Check it.
How could you use do.call
and rbind
to simplify the shape changes we have been doing to esoph
?
Why do we need as.numeric
in the last part of the section above?
Use tapply
to determine the mean quality each patient had in each eye. That is calculate the mean of the quality for the first patient’s right eye, the first patient’s left eye, etc.
Look at the help for tapply
and realize that you can use more than one factor as the INDEX
. Note that the help says it has to be a list
. How do you make a list
?
Why are some values missing in your result?