Rcel, Part 2
Exposition
Introduction
In the first part of this exercise we learned about vector and factor which are two composite data types as we build toward a data.frame. As you recall, a vector requires that each element to be of the same atomic data type.
Now, we meet the first composite R data type that can contain different atomic types as elements simultaneously: list. A specialized list called data.frame will be used to store our datasets and will allow us to operate on our datasets as spreadsheet-like objects.
Lists
Starting with list, a list allows you to mix ‘n’ match atomic data types. You build a list with the function list, like this:
l <- list("A", 2, TRUE)
See the subtle difference between l (“ell”) and 1 (“one”). Notice that there is also a difference between O (“oh”) and 0 (“zero”). This is important so keep an eye out for it throughout this lesson.
Try creating l with the expression above now.
Type l <- list("A", 2, TRUE).
l <- list("A", 2, TRUE)
l <- list("A", 2, TRUE)To extract an element from a list, use double square brackets. For example, l[[1]] will extract the first element of l.
Try it now.
l[[1]]
l[[1]]Now test if the first element of l really is a character.
Use is.character, l, and [[1]] somehow.
is.character(l[[1]])
is.character(l[[1]])Now test if the second element of l really is numeric.
Use is.numeric, l, and [[2]] somehow.
is.numeric(l[[2]])
is.numeric(l[[2]])Now check if the third element of l really is logical.
Use is.logical, l, and [[3]] somehow.
is.logical(l[[3]])
is.logical(l[[3]])Finally, what function do you think checks if something really is a list? Try it now on l.
is.list?
is.list(l)
is.list(l)Each element does not have to be a single atomic value. You can use longer vectors.
Reassign l to be list("A", c(1,3,2,4), TRUE).
Type l <- list("A", c(1,3,2,4), TRUE)
l <- list("A", c(1,3,2,4), TRUE)
l <- list("A", c(1,3,2,4), TRUE)Now examine the value of l and notice how R gives you hints about how to access the elements.
Nothing more than just typing l (“ell”).
l
lNow examine the second element of l.
Don’t forget the double bracket!
l[[2]]
l[[2]]You can also name the elements in a list like this:
l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
Try it now.
Type l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)Now examine the value of l2 and notice how R gives you hints about how to access the elements.
Type l2, that is ell two (l2) not twelve (12).
l2
l2So, those hints tell you to use a dollar sign to access the element by name. For example, to access the logical value in l2, type l2$c.
Type l2$c
l2$c
l2$cWith a vector recall that you can assign to a specific element. Let’s create a vector to try this with. Type x <- c(1, 3, 2, 4).
Type x <- c(1, 3, 2, 4)
x <- c(1, 3, 2, 4)
x <- c(1, 3, 2, 4)If we wanted to reassign the second element to be 100, what would we do?
Type x[2] <- 100
x[2] <- 100
x[2] <- 100Now examine x to check the result.
Type x
x
xNow, change the third element of x to 5.
Type x[3] <- 5
x[3] <- 5
x[3] <- 5You can also assign to parts of a list. Let’s remember what l looks like first. Type l.
Type l (“ell”)
l
lSo if I wanted to assign to the 3rd element of the vector that is the 2nd element of l, what would one do? Start by telling me what the second element of l is?
Type l[[2]] (don’t forget the double brackets)
l[[2]]
l[[2]]l[[2]] is a vector. I want the third element. I use single square brackets to get an element. If I tell you that you can treat l[[2]] just like the name for that vector what would you type to get the third element?
Type l[[2]][3]. If you didn’t get it, let’s talk more about it in class.
l[[2]][3]
l[[2]][3]Now, change that same value to 42.
Use l[[2]][3], the assignment operator (<-), and 42.
l[[2]][3] <- 42
l[[2]][3] <- 42Dataframes
Now, to the last major data type that we are going to learn about in this course, the data.frame. A data.frame is a specialized list where all the elements of the list have equal length. It is perfect for representing data where there are several values (in columns) per observation (in rows).
You create a data.frame with the data.frame function. Type:
df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4),
sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")),
sorethroat = factor(c("no", "no", "no", "no", "yes",
"yes", "yes", "yes")
)
)Here, the places that I put the line endings are not strictly required; they are there to help clarity with where the parentheses open and close.
Just type it out carefully. I know it is a pain, but it will make you appreciate the other ways we will create these in the future!
df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))
df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))Now type df to take a look at your new data.frame!
Type df
df
dfOK, do you remember how to access the element of a list by name? The $ right? So how could you get just the age column out of df?
Does df$age make sense?
df$age
df$ageYou can also use summary on data.frames. Try it now on df.
Does summary(df) make sense?
summary(df)
summary(df)You can check what variables are in a data.frame like this (it also works for a list): names(df). Try it now.
Type names(df)
names(df)
names(df)There are several ways to access elements in a data.frame and each will be useful somewhere as we get better at programming. We can access a single value by row and column. Rows first, columns second. Try df[2, 3].
Type df[2, 3]
df[2, 3]
df[2, 3]If we leave out the column specification, you get the whole row: df[2, ]. Try it.
Type df[2, ]
df[2, ]
df[2, ]If we leave out the row specification, you get the whole column: df[, 3]. Try it.
Type df[, 3]
df[, 3]
df[, 3]You can also use the column name instead of the index when you are accessing elements or columns. An example would be: df[, "sorethroat"]. Try it.
Type df[, "sorethroat"]
df[, "sorethroat"]
df[, "sorethroat"]Finally, you can use the subset command to find rows that match certain criteria. Try subset(df, age == 2) to find the observations where the age of the subject is 2.
Type subset(df, age == 2)
subset(df, age == 2)
subset(df, age == 2)Experimentation
Now try to combine your knowledge of the &, |, and ! operators (you may only need one of those) to find those subjects who are both age 2 and have a sorethroat using the subset function.
The wording implies and (&) and don’t forget the quotes around "yes" for sorethroat because you are accessing by a character type not a variable name.
subset(df, age == 2 & sorethroat == "yes")
subset(df, age == 2 & sorethroat == "yes")What comes out of the subset function is a data.frame that you can further subset using the operators you already know. For example, try subset(df, sorethroat == "yes")$age.
Type subset(df, sorethroat == "yes")$age
subset(df, sorethroat == "yes")$age
subset(df, sorethroat == "yes")$ageAnd we can apply functions to that like summary, e.g., summary(subset(df, sorethroat == "yes")$age). Try it.
Type summary(subset(df, sorethroat == "yes")$age)
summary(subset(df, sorethroat == "yes")$age)
summary(subset(df, sorethroat == "yes")$age)R has a lot of built in data.frames. One we’ll use a lot in this course is the esoph dataset. You load it with data(esoph). Do that now.
Type data(esoph)
data(esoph)
data(esoph)Now that it is loaded (under the name esoph) use summary to examine it.
Type summary(esoph)
summary(esoph)
summary(esoph)Extract esoph’s 4th row.
Type esoph[4, ]
esoph[4, ]
esoph[4, ]List the variable names in esoph.
Type names(esoph)
names(esoph)
names(esoph)esoph is in a case-control format where each row represents multiple cases and controls, in variables named ncases and ncontrols. So, if you are asked how many observations there are you’d add the number of cases and controls.
So, write code that will give the number of cases which have tobacco consumption (tobgp) of 30+ gm/day. You’ll find the function sum helpful, which takes a vector and adds it up. You’ll use it in a pattern just like you did with summary above on a column of a subset.
subset(esoph, tobgp == "30+") is the kernel of what you need. Then, that is a little data.frame from which you need the variable ncases and you then need to add it up - how do you put it all together?
A couple of examples that work are sum(subset(esoph, tobgp == "30+")$ncases) and sum(esoph$ncases[esoph$tobgp == "30+"]).
OK, here is your most challenging problem yet. What is the number of observations (ncases + ncontrols) where both the age group (agegp) is 25-34 years and tobacco consumption is 10-19 gm/day?
Don’t overthink it. It is a straightforward extension of the last problem. Calculate each of the two pieces separately even though it requires repetition and simply add them together with + is one solution.
Some things that work are sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")$ncases) + sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")$ncontrols), sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")[, c("ncases", "ncontrols")]), and sum(esoph$ncases[esoph$agegp == "25-34" & esoph$tobgp == "10-19"],esoph$ncontrols[esoph$agegp == "25-34" & esoph$tobgp == "10-19"]).
Evaluation
Submit Your Assignment
Submit your assignment below.