Exercise

In the first part of this exercise we learned about vector and factor which are two building blocks of a data.frame. As you recall, a vector requires that each element be of the same atomic data type. Now, we meet the first composite data type that can contain multiple atomic types as elements simultaneously: list. A specialized list called data.frame will be used to store our datasets and will allow us to operate on our datasets as spreadsheet-like objects.

Lists

A list allows you to mix ‘n’ match atomic data types. You bulid a list with the function list().

l <- list("A", 2, TRUE)
is.character(l[[1]])  # see the subtle difference between l ("ell") and 1 ("one")?
## [1] TRUE
is.numeric(l[[2]])
## [1] TRUE
is.logical(l[[3]])
## [1] TRUE
is.list(l)
## [1] TRUE

Did you notice the [[]] syntax used to index a list? So, each element of a list does not have to be just a single atomic element, but can be a vector:

l <- list("A", c(1,3,2,4), TRUE)
l # notice the hints that R gives you about how to access a given element like it did for vectors?
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 3 2 4
## 
## [[3]]
## [1] TRUE
l[[2]]
## [1] 1 3 2 4

You can also name the elements, and using the $ syntax you can access those elements using that name:

l2 <- list(a = "A", b = c(1, 3, 2, 4), c = TRUE)
l2 # again notice R's hints about accessing elements
## $a
## [1] "A"
## 
## $b
## [1] 1 3 2 4
## 
## $c
## [1] TRUE
l2$c
## [1] TRUE

Extending assigment

With a vector you can reassign a specific element like this:

x <- c(1, 3, 2, 4)
x[2] <- 2
x
## [1] 1 2 2 4

You can also assign to parts of a list using the list syntax (see Explore and Extend below).

Data frames

A data.frame is a specialized list where all the elements of the list have equal length. It is perfect for representing data where there are several values (in columns) per observation (in rows). You create a data.frame with the data.frame() function:

df <- data.frame(age = c(3, 2, 3, 3, 1, 2, 4, 4), 
                 sex = factor(c("m", "f", "m", "f", "m", "f", "m", "f")), 
                 sorethroat = factor(c("no", "no", "no", "no", "yes", "yes", "yes", "yes")))
df
##   age sex sorethroat
## 1   3   m         no
## 2   2   f         no
## 3   3   m         no
## 4   3   f         no
## 5   1   m        yes
## 6   2   f        yes
## 7   4   m        yes
## 8   4   f        yes
df$sex
## [1] m f m f m f m f
## Levels: f m
summary(df)
##       age       sex   sorethroat
##  Min.   :1.00   f:4   no :4     
##  1st Qu.:2.00   m:4   yes:4     
##  Median :3.00                   
##  Mean   :2.75                   
##  3rd Qu.:3.25                   
##  Max.   :4.00

In general, we will usually import a data.frame from an external file, like an Excel spreadsheet, but there are times when you want to build a data.frame from scratch, especially as you attempt to restructure data.

You can check what variables are in a data.frame like this (it also works for list):

names(df)
## [1] "age"        "sex"        "sorethroat"

We can reference an exact row and column pair in a data.frame (rows first, columns second - this is the usual convention in mathematics and programming):

df[2, 3]
## [1] no
## Levels: no yes

If we leave out the row or column specification, we get the whole row or column:

df[2, ] # entire second row
##   age sex sorethroat
## 2   2   f         no
df[, 3] # entire third column
## [1] no  no  no  no  yes yes yes yes
## Levels: no yes

You can use the name of the column too:

df[2, "sorethroat"]  # same as df[2, 3]
## [1] no
## Levels: no yes
df[, "sorethroat"]   # same as df[, 3]
## [1] no  no  no  no  yes yes yes yes
## Levels: no yes

Finally, you can use the subset command to find rows that match certain criteria:

subset(df, age == 2)
##   age sex sorethroat
## 2   2   f         no
## 6   2   f        yes
subset(df, age == 2 & sorethroat == "yes")
##   age sex sorethroat
## 6   2   f        yes

This is a new data.frame that you can subset and operate upon:

subset(df, sorethroat == "yes")$age
## [1] 1 2 4 4
median(subset(df, sorethroat == "yes")$age)
## [1] 3

Explore and Extend

as.character(df$sorethroat)
## [1] "no"  "no"  "no"  "no"  "yes" "yes" "yes" "yes"

Evaluate

R includes several datasets. You can see what is included by running the data() command. For this evaluation, we will use the esoph dataset. You can load it like this data(esoph). After that, you can refer to it with just esoph. If you are curious about the data, run ?esoph which will pull up the help file for the data. This file is in a case-control format where each row represents multiple cases and controls, in variables ncases and ncontrols. So, if you are asked how many observations there are you’d add the number of cases and controls. Include the R code for each problem. Do not do only by hand.

  1. Display a simple summary of esoph.

  2. Extract the 4th row of esoph.

  3. List the names of esoph.

  4. What is the number of cases which have tobacco consumption (tobgp) of 30+ gm/day? You’ll find the command sum() helpful. Use it like median() above.

  5. What is the number of observations where both the age group (agegp) is 25-34 years and tobacco consumption is 10-19 gm/day?