In the first part of this exercise we learned about vector
and factor
which are two building blocks of a data.frame
. As you recall, a vector
requires that each element be of the same atomic data type. Now, we meet the first composite data type that can contain multiple atomic types as elements simultaneously: list
. A specialized list
called data.frame
will be used to store our datasets and will allow us to operate on our datasets as spreadsheet-like objects.
A list
allows you to mix ‘n’ match atomic data types. You bulid a list with the function list()
.
l <- list("A", 2, TRUE)
is.character(l[[1]]) # see the subtle difference between l ("ell") and 1 ("one")?
## [1] TRUE
is.numeric(l[[2]])
## [1] TRUE
is.logical(l[[3]])
## [1] TRUE
is.list(l)
## [1] TRUE
Did you notice the [[]]
syntax used to index a list
? So, each element of a list
does not have to be just a single atomic element, but can be a vector
:
l <- list("A", c(1,3,2,4), TRUE)
l # notice the hints that R gives you about how to access a given element like it did for vectors?
## [[1]]
## [1] "A"
##
## [[2]]
## [1] 1 3 2 4
##
## [[3]]
## [1] TRUE
l[[2]]
## [1] 1 3 2 4
You can also name the elements, and using the $
syntax you can access those elements using that name:
l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
l2 # again notice R's hints about accessing elements
## $a
## [1] "A"
##
## $b
## [1] 1 3 2 4
##
## $c
## [1] TRUE
l2$c
## [1] TRUE
With a vector
you can reassign a specific element like this:
x <- c(1, 3, 2, 4)
x[2] <- 2
x
## [1] 1 2 2 4
You can also assign to parts of a list
using the list syntax (see Explore and Extend below).
A data.frame
is a specialized list
where all the elements of the list
have equal length. It is perfect for representing data where there are several values (in columns) per observation (in rows). You create a data.frame
with the data.frame()
function:
df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4),
sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")),
sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))
df
## age sex sorethroat
## 1 3 m no
## 2 2 f no
## 3 3 m no
## 4 3 f no
## 5 1 m yes
## 6 2 f yes
## 7 4 m yes
## 8 4 f yes
df$sex
## [1] m f m f m f m f
## Levels: f m
summary(df)
## age sex sorethroat
## Min. :1.00 f:4 no :4
## 1st Qu.:2.00 m:4 yes:4
## Median :3.00
## Mean :2.75
## 3rd Qu.:3.25
## Max. :4.00
In general, we will usually import a data.frame
from an external file, like an Excel spreadsheet, but there are times when you want to build a data.frame
from scratch, especially as you attempt to restructure data.
You can check what variables are in a data.frame
like this (it also works for list
):
names(df)
## [1] "age" "sex" "sorethroat"
We can reference an exact row and column pair in a data.frame
(rows first, columns second - this is the usual convention in mathematics and programming):
df[2, 3]
## [1] no
## Levels: no yes
If we leave out the row or column specification, we get the whole row or column:
df[2, ] # entire second row
## age sex sorethroat
## 2 2 f no
df[, 3] # entire third column
## [1] no no no no yes yes yes yes
## Levels: no yes
You can use the name of the column too:
df[2, "sorethroat"] # same as df[2, 3]
## [1] no
## Levels: no yes
df[, "sorethroat"] # same as df[, 3]
## [1] no no no no yes yes yes yes
## Levels: no yes
Finally, you can use the subset
command to find rows that match certain criteria:
subset(df, age == 2)
## age sex sorethroat
## 2 2 f no
## 6 2 f yes
subset(df, age == 2 & sorethroat == "yes")
## age sex sorethroat
## 6 2 f yes
This is a new data.frame
that you can subset and operate upon:
subset(df, sorethroat == "yes")$age
## [1] 1 2 4 4
median(subset(df, sorethroat == "yes")$age)
## [1] 3
Demostrate two ways to extract the numeric element 2
from the second element of l2
above. (Hint: combine the syntax from this exercise with that of the last one and don’t forget to look for the subtle difference between l (“ell”) and 1 (“one”) which I’m doing to keep you on your toes - that’s not 12
but l2
).
What data type is l2[1]
vs. l2[[1]]
vs. l2$a
? What happens when you do l2[c(1,2)]
vs. l2[[c(1,2)]]
? (Notice the difference between this and how we reference rows and columns of a data.frame
.) You know that you can extract elements from a vector
using []
. So now, can you extract a numeric vector equivalent to c(3,4)
from l2
by generalizing the ideas from this question and the last one?
Demostrate two ways to reassign the second element of l2
to a value of 900
. Notice that there is also a difference between O
(“oh”) and 0
(“zero”).
Prove that a data.frame
is a list
. Prove that an arbitrary list
like l2
is not a data.frame
(guess what the function is to test if something is a data.frame
).
In df
, replace the 5th observation’s sex with "f"
. Now, try to replace the 5th observation’s sore throat with "maybe"
. Can you make sense of the error message? Can you create a solution such that at the end df
still contains a factor called sorethroat
with the help of the function as.character
? (By the way, there are as.
versions of all the types we’ve studied.) It is used like this:
as.character(df$sorethroat)
## [1] "no" "no" "no" "no" "yes" "yes" "yes" "yes"
R includes several datasets. You can see what is included by running the data()
command. For this evaluation, we will use the esoph
dataset. You can load it like this data(esoph)
. After that, you can refer to it with just esoph
. If you are curious about the data, run ?esoph
which will pull up the help file for the data. This file is in a case-control format where each row represents multiple cases and controls, in variables ncases
and ncontrols
. So, if you are asked how many observations there are you’d add the number of cases and controls. Include the R code for each problem. Do not do only by hand.
esoph
.summary(esoph)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
esoph
.esoph[4, ]
## agegp alcgp tobgp ncases ncontrols
## 4 25-34 0-39g/day 30+ 0 5
esoph
.names(esoph)
## [1] "agegp" "alcgp" "tobgp" "ncases" "ncontrols"
tobgp
) of 30+ gm/day? You’ll find the command sum()
helpful. Use it like median()
above.sum(subset(esoph, tobgp == "30+")$ncases)
## [1] 31
agegp
) is 25-34 years and tobacco consumption is 10-19 gm/day?sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")[, c("ncases", "ncontrols")])
## [1] 20
# another way of many
esoph.ss <- subset(esoph, agegp == "25-34" & tobgp == "10-19")
sum(esoph.ss$ncases) + sum(esoph.ss$ncontrols)
## [1] 20