Exercise

Missing data

We’ve already seen the missing data value: NA. This special value allows R to handle missing data gracefully. However, just like in other programming languages you sometimes have to be careful how you handle NA.

NA
## [1] NA
is.na(NA) # why is.na vs. is.NA?  poor choice IMO.
## [1] TRUE
is.na(2)
## [1] FALSE
myvec <- c(1, 2, NA, 3)
is.na(myvec)
## [1] FALSE FALSE  TRUE FALSE
mean(myvec)
## [1] NA

What happened? Why is mean of myvec NA? By default, the mean function does this intentionally so that you know that the vector is missing data because at some level it is unclear what you want to do. Under most circumstances you just want to remove the missing values and calculate the mean on what you have:

mean(myvec, na.rm = TRUE)
## [1] 2

This is also important when dealing with data.frame’s. Let’s inject some missing data into esoph.

set.seed(596)
esona <- esoph
esona[sample(NROW(esona), 3), "ncases"] <- NA
esona[sample(NROW(esona), 3), "ncontrols"] <- NA
summary(esona)
##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.259   Mean   :10.62  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00  
##                                          NA's   :3        NA's   :3

Note there are now three NA’s tabulated by the summary function which are ignored in the summary statistics.
However, if you try to take the mean of the ncases column without specifying na.rm you’ll get NA again, just like above:

mean(esona$ncases)
## [1] NA
mean(esona$ncases, na.rm = TRUE)
## [1] 2.258824

You may be interested in which are and the number of complete cases:

complete.cases(esona)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [34]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [56]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
sum(complete.cases(esona))
## [1] 82
# how many incomplete
sum(!complete.cases(esona))
## [1] 6

You can also extract the data.frame where every row has complete data (i.e., the complete cases):

esocc <- na.omit(esona)
NROW(esocc)
## [1] 82

When you have missing data and you are subsetting you have to be more careful. Let’s look at the following:

esona$ncases == 2
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE    NA
## [34] FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [45]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE  TRUE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

See the NA’s in there? So, when you select those rows using the techniques in the “Food Prep” module there is some serious weirdness:

esona[esona$ncases == 2, ]
##      agegp     alcgp    tobgp ncases ncontrols
## 28   35-44      120+ 0-9g/day      2         3
## 30   35-44      120+    20-29      2         4
## NA    <NA>      <NA>     <NA>     NA        NA
## NA.1  <NA>      <NA>     <NA>     NA        NA
## 42   45-54    80-119      30+      2         4
## 45   45-54      120+    20-29      2         3
## 47   55-64 0-39g/day 0-9g/day      2        NA
## NA.2  <NA>      <NA>     <NA>     NA        NA
## 65   65-74 0-39g/day    20-29      2         7
## 72   65-74    80-119    20-29      2         3
## 79     75+ 0-39g/day    10-19      2         6
## 81     75+     40-79 0-9g/day      2         5
## 87     75+      120+ 0-9g/day      2         2

Why? Look at these examples:

# a vector selection
x <- c("a", "b", "c")
x[c(NA, 2, 1)]
## [1] NA  "b" "a"
# back to the data frame
esona[c(NA, 2, 1), ]
##    agegp     alcgp    tobgp ncases ncontrols
## NA  <NA>      <NA>     <NA>     NA        NA
## 2  25-34 0-39g/day    10-19      0        10
## 1  25-34 0-39g/day 0-9g/day      0        40

So, when R finds an NA in a vector to be used as a slice it returns a empty version (in the data.frame case above an empty row). How do we work around that? Incorporate the is.NA test:

# says select ncases that are equal to 2 and not NA
esona[esona$ncases == 2 & !is.na(esona$ncases),]
##    agegp     alcgp    tobgp ncases ncontrols
## 28 35-44      120+ 0-9g/day      2         3
## 30 35-44      120+    20-29      2         4
## 42 45-54    80-119      30+      2         4
## 45 45-54      120+    20-29      2         3
## 47 55-64 0-39g/day 0-9g/day      2        NA
## 65 65-74 0-39g/day    20-29      2         7
## 72 65-74    80-119    20-29      2         3
## 79   75+ 0-39g/day    10-19      2         6
## 81   75+     40-79 0-9g/day      2         5
## 87   75+      120+ 0-9g/day      2         2

That works because anything that is NA will be set to FALSE instead of NA and not included in the selection. Subset does not have this problem, but remember you cannot assign directly to a subset so you’ll need the above syntax sometimes.

subset(esona, ncases == 2)
##    agegp     alcgp    tobgp ncases ncontrols
## 28 35-44      120+ 0-9g/day      2         3
## 30 35-44      120+    20-29      2         4
## 42 45-54    80-119      30+      2         4
## 45 45-54      120+    20-29      2         3
## 47 55-64 0-39g/day 0-9g/day      2        NA
## 65 65-74 0-39g/day    20-29      2         7
## 72 65-74    80-119    20-29      2         3
## 79   75+ 0-39g/day    10-19      2         6
## 81   75+     40-79 0-9g/day      2         5
## 87   75+      120+ 0-9g/day      2         2

Why? Because if you read the help for subset,it says uses similar syntax internally to that we used above!

Other special values

You met a lot of these in the very first module:

5/0
## [1] Inf
-5/0
## [1] -Inf
0/0
## [1] NaN

NaN means “not a number”, Inf means “positive infinity”, and -Inf means “negative infinity”. They are almost always seen because you are dividing by zero, usually unintentionally. You can test for these values which can help you find bugs in your programs.

is.nan(0/0)
## [1] TRUE
is.infinite(1/0)
## [1] TRUE
is.finite(1/0)
## [1] FALSE

Finally, the NULL value is a marker for “nothing.” Not nothing as in something that ought to be there but is not (e.g., missing), but literally, something that does not or should not exist in the realm of R. It is a good way to remove a column from a data.frame:

esocc$ncases <- NULL
head(esocc)
##   agegp     alcgp    tobgp ncontrols
## 1 25-34 0-39g/day 0-9g/day        40
## 2 25-34 0-39g/day    10-19        10
## 3 25-34 0-39g/day    20-29         6
## 4 25-34 0-39g/day      30+         5
## 5 25-34     40-79 0-9g/day        27
## 6 25-34     40-79    10-19         7

Other useful set and subsetting functions

duplicated finds duplicated values. It also works to find duplicated rows in data.frames. unique will extract the values that are, well, unique!

x <- c(4, 8, 0, 1, 8, 9, 10, 4, 2)
duplicated(x)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
unique(x)
## [1]  4  8  0  1  9 10  2

These functions allow you to make selections based on “set” operations: union joins two things together (without duplication), intersection finds the common values between two vectors, and setdiff finds the difference between two vectors.

union(c("a", "b", "c"), c("b", "c", "d", "f"))
## [1] "a" "b" "c" "d" "f"
intersect(c("a", "b", "c"), c("b", "c", "d"))
## [1] "b" "c"
setdiff(c(1, 2, 3, 8, 9), c(2, 8))
## [1] 1 3 9

Text

Pasting

We have seen paste0 which pastes a series of strings together without any other character between.

paste0(c("a", "b", "c", "d"), 1, c(1, 2))
## [1] "a11" "b12" "c11" "d12"

The more general function is paste. It takes two options: sep which is the separator between the elements. It can also take a collapse argument which if set pastes the entire sequence of strings together using that character.

paste(c("a", "b", "c"), 1, c(1, 2), sep = " ")
## [1] "a 1 1" "b 1 2" "c 1 1"
paste(c("a", "b", "c"), 1, c(1, 2), sep = " ", collapse = "-")
## [1] "a 1 1-b 1 2-c 1 1"

Substrings

Sometimes you need to select out a piece of a string. The substr function is helpful for fixed length strings. Note that the [1:3] select out the first three elements of the names of esoph while the other 1 and 3 tell it to take the first through the third of each string. They are NOT related.

names(esoph)[1:3]
## [1] "agegp" "alcgp" "tobgp"
substr(names(esoph)[1:3], 1, 3)
## [1] "age" "alc" "tob"

Regular expressions / Searching

Regular expressions are a very powerful way of finding patterns in text and manipulating them. You could take an entire course on regular expressions so I will only be able to introduce you to some basic aspects.

Let’s start with a list of words:

words <- c("art", "bat", "bet", "bee", "bees", "beet", "believe", "beat", 
           "cat", "car", "cars", "can", "cart", "mississipi",
           "mart", "part", "see", "sat", "set")

We will use the grep function which takes a regular expression and character vector to search on. If you just use a sequence of letters or numbers you will find strings that match that pattern of letters and numbers:

grep("s", words)
## [1]  5 11 14 17 18 19
# to see the matched words
grep("s", words, value = TRUE)
## [1] "bees"       "cars"       "mississipi" "see"        "sat"       
## [6] "set"

Note how each contains an “s”. Here is another example:

grep("ee", words, value = TRUE)
## [1] "bee"  "bees" "beet" "see"

Nothing ground breaking there… however there are characters with special meanings. Such as “.” that will match any character. To find words that have an “a” followed by any character:

grep("a.", words, value = TRUE)
##  [1] "art"  "bat"  "beat" "cat"  "car"  "cars" "can"  "cart" "mart" "part"
## [11] "sat"
grep("s.t", words, value = TRUE)
## [1] "sat" "set"

You can put a group of characters in square brackets:

grep("[bs].t", words, value = TRUE)
## [1] "bat" "bet" "sat" "set"

The * character matches zero or more characters matching the previous expression (character or group):

grep("beliea*ve", words, value = TRUE) 
## [1] "believe"

The + character matches one or more of the previous expression:

grep("beliea+ve", words, value = TRUE) # nothing matches
## character(0)
grep("be+t", words, value = TRUE)
## [1] "bet"  "beet"

Finally you can group things in parentheses:

grep("m(iss)+ipi", words, value = TRUE)
## [1] "mississipi"

I find regular expressions to be very helpful for selecting a series of variables in a data.frame:

grep("gp", names(esoph), value = TRUE)
## [1] "agegp" "alcgp" "tobgp"
head(esoph[, grep("gp", names(esoph), value = TRUE)])
##   agegp     alcgp    tobgp
## 1 25-34 0-39g/day 0-9g/day
## 2 25-34 0-39g/day    10-19
## 3 25-34 0-39g/day    20-29
## 4 25-34 0-39g/day      30+
## 5 25-34     40-79 0-9g/day
## 6 25-34     40-79    10-19

The grepl function is also useful as it gives the logical vector of whether a match is found in a sequence an may be useful in some settings where the index or value will not work:

grepl("gp", names(esoph))
## [1]  TRUE  TRUE  TRUE FALSE FALSE

There is so much more you could learn about this, but even this basic introduction can take you very far. For more information, I recommend you check out this online tutorial: http://www.zytrax.com/tech/web/regex.htm.

Please note that you can use regular expressions to select out parts of text, rearrange, or manipulate various strings. See the sub and gsub functions for more information.

Dates

Dates are often difficult to work with because there are many different formats and methods of converting to a system that the computer can work with.

In R, the default date format is YYYY-MM-DD, i.e., 4 digit year, hyphen, 1–2 digit month, hyphen, and 1–2 digit day. The as.Date() function allows you to take character strings and convert them to dates that the computer can understand.

as.Date("2011-01-12")
## [1] "2011-01-12"

You can also use other formats if you receive data from someone who typed it in differently by providing a format argument:

as.Date("01/12/2011", format="%m/%d/%Y")
## [1] "2011-01-12"
as.Date("12jan2011", "%d%b%Y")
## [1] "2011-01-12"

For more information, read the help file for strptime. The difftime function allows you calculate various intervals between dates:

difftime(as.Date("2011-1-14"),as.Date("2011-1-12"))
## Time difference of 2 days
difftime(as.Date("2011-1-14"),as.Date("2011-1-12"),units="hours")
## Time difference of 48 hours

Explore and Extend

Evaluate

Pick a data set of your own choosing (perhaps this is an opportunity to test drive the data you want to use for your final project?). Use one of the text or date functions to do something you feel is useful to your data. Explain why and demonstrate it in a knitted .Rmd file.