Rcel, part 2

Exercise

In the first part of this exercise we learned about vector and factor which are two building blocks of a data.frame. As you recall, a vector requires that each element be of the same atomic data type. Now, we meet the first composite data type that can contain multiple atomic types as elements simultaneously: list. A specialized list called data.frame will be used to store our datasets and will allow us to operate on our datasets as spreadsheet-like objects.

Lists

A list allows you to mix ‘n’ match atomic data types. You bulid a list with the function list().

l <- list("A", 2, TRUE)
is.character(l[[1]])  # see the subtle difference between l ("ell") and 1 ("one")?

## [1] TRUE

is.numeric(l[[2]])

## [1] TRUE

is.logical(l[[3]])

## [1] TRUE

is.list(l)

## [1] TRUE

Did you notice the [[]] syntax used to index a list? So, each element of a list does not have to be just a single atomic element, but can be a vector:

l <- list("A", c(1,3,2,4), TRUE)
l # notice the hints that R gives you about how to access a given element like it did for vectors?

## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 3 2 4
## 
## [[3]]
## [1] TRUE

l[[2]]

## [1] 1 3 2 4

You can also name the elements, and using the $ syntax you can access those elements using that name:

l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
l2 # again notice R's hints about accessing elements

## $a
## [1] "A"
## 
## $b
## [1] 1 3 2 4
## 
## $c
## [1] TRUE

l2$c

## [1] TRUE

Extending assigment

With a vector you can reassign a specific element like this:

x <- c(1, 3, 2, 4)
x[2] <- 2
x

## [1] 1 2 2 4

You can also assign to parts of a list using the list syntax (see Explore and Extend below).

Data frames

A data.frame is a specialized list where all the elements of the list have equal length. It is perfect for representing data where there are several values (in columns) per observation (in rows). You create a data.frame with the data.frame() function:

df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), 
                 sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), 
                 sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))
df

##   age sex sorethroat
## 1   3   m         no
## 2   2   f         no
## 3   3   m         no
## 4   3   f         no
## 5   1   m        yes
## 6   2   f        yes
## 7   4   m        yes
## 8   4   f        yes

df$sex

## [1] m f m f m f m f
## Levels: f m

summary(df)

##       age       sex   sorethroat
##  Min.   :1.00   f:4   no :4     
##  1st Qu.:2.00   m:4   yes:4     
##  Median :3.00                   
##  Mean   :2.75                   
##  3rd Qu.:3.25                   
##  Max.   :4.00

In general, we will usually import a data.frame from an external file, like an Excel spreadsheet, but there are times when you want to build a data.frame from scratch, especially as you attempt to restructure data.

You can check what variables are in a data.frame like this (it also works for list):

names(df)

## [1] "age"        "sex"        "sorethroat"

We can reference an exact row and column pair in a data.frame (rows first, columns second - this is the usual convention in mathematics and programming):

df[2, 3]

## [1] no
## Levels: no yes

If we leave out the row or column specification, we get the whole row or column:

df[2, ] # entire second row

##   age sex sorethroat
## 2   2   f         no

df[, 3] # entire third column

## [1] no  no  no  no  yes yes yes yes
## Levels: no yes

You can use the name of the column too:

df[2, "sorethroat"]  # same as df[2, 3]

## [1] no
## Levels: no yes

df[, "sorethroat"]   # same as df[, 3]

## [1] no  no  no  no  yes yes yes yes
## Levels: no yes

Finally, you can use the subset command to find rows that match certain criteria:

subset(df, age == 2)

##   age sex sorethroat
## 2   2   f         no
## 6   2   f        yes

subset(df, age == 2 & sorethroat == "yes")

##   age sex sorethroat
## 6   2   f        yes

This is a new data.frame that you can subset and operate upon:

subset(df, sorethroat == "yes")$age

## [1] 1 2 4 4

median(subset(df, sorethroat == "yes")$age)

## [1] 3

Explore and Extend

Demostrate two ways to extract the numeric element 2 from the second element of l2 above. (Hint: combine the syntax from this exercise with that of the last one and don’t forget to look for the subtle difference between l (“ell”) and 1 (“one”) which I’m doing to keep you on your toes - that’s not 12 but l2).
What data type is l2[1] vs. l2[[1]] vs. l2$a? What happens when you do l2[c(1,2)] vs. l2[[c(1,2)]]? (Notice the difference between this and how we reference rows and columns of a data.frame.) You know that you can extract elements from a vector using []. So now, can you extract a numeric vector equivalent to c(3,4) from l2 by generalizing the ideas from this question and the last one?
Demostrate two ways to reassign the second element of l2 to a value of 900. Notice that there is also a difference between O (“oh”) and 0 (“zero”).
Prove that a data.frame is a list. Prove that an arbitrary list like l2 is not a data.frame (guess what the function is to test if something is a data.frame).
In df, replace the 5th observation’s sex with "f". Now, try to replace the 5th observation’s sore throat with "maybe". Can you make sense of the error message? Can you create a solution such that at the end df still contains a factor called sorethroat with the help of the function as.character? (By the way, there are as. versions of all the types we’ve studied.) It is used like this:

as.character(df$sorethroat)

## [1] "no"  "no"  "no"  "no"  "yes" "yes" "yes" "yes"

Evaluate

R includes several datasets. You can see what is included by running the data() command. For this evaluation, we will use the esoph dataset. You can load it like this data(esoph). After that, you can refer to it with just esoph. If you are curious about the data, run ?esoph which will pull up the help file for the data. This file is in a case-control format where each row represents multiple cases and controls, in variables ncases and ncontrols. So, if you are asked how many observations there are you’d add the number of cases and controls. Include the R code for each problem. Do not do only by hand.

Display a simple summary of esoph.

summary(esoph)

##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

Extract the 4th row of esoph.

esoph[4, ]

##   agegp     alcgp tobgp ncases ncontrols
## 4 25-34 0-39g/day   30+      0         5

List the names of esoph.

names(esoph)

## [1] "agegp"     "alcgp"     "tobgp"     "ncases"    "ncontrols"

What is the number of cases which have tobacco consumption (tobgp) of 30+ gm/day? You’ll find the command sum() helpful. Use it like median() above.

sum(subset(esoph, tobgp == "30+")$ncases)

## [1] 31

What is the number of observations where both the age group (agegp) is 25-34 years and tobacco consumption is 10-19 gm/day?

sum(subset(esoph, agegp == "25-34" & tobgp == "10-19")[, c("ncases", "ncontrols")])

## [1] 20

# another way of many
esoph.ss <- subset(esoph, agegp == "25-34" & tobgp == "10-19")
sum(esoph.ss$ncases) + sum(esoph.ss$ncontrols)

## [1] 20