Rcel, Part 2

Author

Affiliation

Beau B. Bruce, MD, PhD

Emory University

Exposition

Introduction

In the first part of this exercise we learned about vector and factor which are two composite data types as we build toward a data.frame. As you recall, a vector requires that each element to be of the same atomic data type.

Now, we meet the first composite R data type that can contain different atomic types as elements simultaneously: list. A specialized list called data.frame will be used to store our datasets and will allow us to operate on our datasets as spreadsheet-like objects.

Lists

Starting with list, a list allows you to mix ‘n’ match atomic data types. You build a list with the function list, like this:

l <- list("A", 2, TRUE)

See the subtle difference between l (“ell”) and 1 (“one”). Notice that there is also a difference between O (“oh”) and 0 (“zero”). This is important so keep an eye out for it throughout this lesson.

Try creating l with the expression above now.

To extract an element from a list, use double square brackets. For example, l[[1]] will extract the first element of l.

Try it now.

Now test if the first element of l really is a character.

Now test if the second element of l really is numeric.

Now check if the third element of l really is logical.

Finally, what function do you think checks if something really is a list? Try it now on l.

Each element does not have to be a single atomic value. You can use longer vectors.

Reassign l to be list("A", c(1,3,2,4), TRUE).

Now examine the value of l and notice how R gives you hints about how to access the elements.

Now examine the second element of l.

You can also name the elements in a list like this:

l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)

Try it now.

Now examine the value of l2 and notice how R gives you hints about how to access the elements.

So, those hints tell you to use a dollar sign to access the element by name. For example, to access the logical value in l2, type l2$c.

With a vector recall that you can assign to a specific element. Let’s create a vector to try this with. Type x <- c(1, 3, 2, 4).

If we wanted to reassign the second element to be 100, what would we do?

Now examine x to check the result.

Now, change the third element of x to 5.

You can also assign to parts of a list. Let’s remember what l looks like first. Type l.

So if I wanted to assign to the 3rd element of the vector that is the 2nd element of l, what would one do? Start by telling me what the second element of l is?

l[[2]] is a vector. I want the third element. I use single square brackets to get an element. If I tell you that you can treat l[[2]] just like the name for that vector what would you type to get the third element?

Now, change that same value to 42.

Dataframes

Now, to the last major data type that we are going to learn about in this course, the data.frame. A data.frame is a specialized list where all the elements of the list have equal length. It is perfect for representing data where there are several values (in columns) per observation (in rows).

You create a data.frame with the data.frame function. Type:

df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), 
                 sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), 
                 sorethroat = factor(c("no", "no", "no", "no", "yes", 
                                       "yes", "yes", "yes")
                                    )
                )

Here, the places that I put the line endings are not strictly required; they are there to help clarity with where the parentheses open and close.

Note

df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))

Now type df to take a look at your new data.frame!

OK, do you remember how to access the element of a list by name? The $ right? So how could you get just the age column out of df?

You can also use summary on data.frames. Try it now on df.

You can check what variables are in a data.frame like this (it also works for a list): names(df). Try it now.

There are several ways to access elements in a data.frame and each will be useful somewhere as we get better at programming. We can access a single value by row and column. Rows first, columns second. Try df[2, 3].

If we leave out the column specification, you get the whole row: df[2, ]. Try it.

If we leave out the row specification, you get the whole column: df[, 3]. Try it.

You can also use the column name instead of the index when you are accessing elements or columns. An example would be: df[, "sorethroat"]. Try it.

Finally, you can use the subset command to find rows that match certain criteria. Try subset(df, age == 2) to find the observations where the age of the subject is 2.

Experimentation

Now try to combine your knowledge of the &, |, and ! operators (you may only need one of those) to find those subjects who are both age 2 and have a sorethroat using the subset function.

What comes out of the subset function is a data.frame that you can further subset using the operators you already know. For example, try subset(df, sorethroat == "yes")$age.

And we can apply functions to that like summary, e.g., summary(subset(df, sorethroat == "yes")$age). Try it.

R has a lot of built in data.frames. One we’ll use a lot in this course is the esoph dataset. You load it with data(esoph). Do that now.

Now that it is loaded (under the name esoph) use summary to examine it.

Extract esoph’s 4th row.

List the variable names in esoph.

esoph is in a case-control format where each row represents multiple cases and controls, in variables named ncases and ncontrols. So, if you are asked how many observations there are you’d add the number of cases and controls.

So, write code that will give the number of cases which have tobacco consumption (tobgp) of 30+ gm/day. You’ll find the function sum helpful, which takes a vector and adds it up. You’ll use it in a pattern just like you did with summary above on a column of a subset.

OK, here is your most challenging problem yet. What is the number of observations (ncases + ncontrols) where both the age group (agegp) is 25-34 years and tobacco consumption is 10-19 gm/day?

Note

Some things that work are sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")$ncases) + sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")$ncontrols), sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")[, c("ncases", "ncontrols")]), and sum(esoph$ncases[esoph$agegp == "25-34" & esoph$tobgp == "10-19"],esoph$ncontrols[esoph$agegp == "25-34" & esoph$tobgp == "10-19"]).

Evaluation

Submit Your Assignment

Submit your assignment below.