We’ve already seen the missing data value: NA
. This special value allows R to handle missing data gracefully. However, just like in other programming languages you sometimes have to be careful how you handle NA
.
NA
## [1] NA
is.na(NA) # why is.na vs. is.NA? poor choice IMO.
## [1] TRUE
is.na(2)
## [1] FALSE
myvec <- c(1, 2, NA, 3)
is.na(myvec)
## [1] FALSE FALSE TRUE FALSE
mean(myvec)
## [1] NA
What happened? Why is mean of myvec
NA
? By default, the mean
function does this intentionally so that you know that the vector
is missing data because at some level it is unclear what you want to do. Under most circumstances you just want to remove the missing values and calculate the mean on what you have:
mean(myvec, na.rm = TRUE)
## [1] 2
This is also important when dealing with data.frame
’s. Let’s inject some missing data into esoph
.
set.seed(596)
esona <- esoph
esona[sample(NROW(esona), 3), "ncases"] <- NA
esona[sample(NROW(esona), 3), "ncontrols"] <- NA
summary(esona)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.259 Mean :10.62
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
## NA's :3 NA's :3
Note there are now three NA
’s tabulated by the summary
function which are ignored in the summary statistics.
However, if you try to take the mean of the ncases
column without specifying na.rm
you’ll get NA
again, just like above:
mean(esona$ncases)
## [1] NA
mean(esona$ncases, na.rm = TRUE)
## [1] 2.258824
You may be interested in which are and the number of complete cases:
complete.cases(esona)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [34] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sum(complete.cases(esona))
## [1] 82
# how many incomplete
sum(!complete.cases(esona))
## [1] 6
You can also extract the data.frame
where every row has complete data (i.e., the complete cases):
esocc <- na.omit(esona)
NROW(esocc)
## [1] 82
When you have missing data and you are subsetting you have to be more careful. Let’s look at the following:
esona$ncases == 2
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE NA
## [34] FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [45] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE TRUE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
See the NA
’s in there? So, when you select those rows using the techniques in the “Food Prep” module there is some serious weirdness:
esona[esona$ncases == 2, ]
## agegp alcgp tobgp ncases ncontrols
## 28 35-44 120+ 0-9g/day 2 3
## 30 35-44 120+ 20-29 2 4
## NA <NA> <NA> <NA> NA NA
## NA.1 <NA> <NA> <NA> NA NA
## 42 45-54 80-119 30+ 2 4
## 45 45-54 120+ 20-29 2 3
## 47 55-64 0-39g/day 0-9g/day 2 NA
## NA.2 <NA> <NA> <NA> NA NA
## 65 65-74 0-39g/day 20-29 2 7
## 72 65-74 80-119 20-29 2 3
## 79 75+ 0-39g/day 10-19 2 6
## 81 75+ 40-79 0-9g/day 2 5
## 87 75+ 120+ 0-9g/day 2 2
Why? Look at these examples:
# a vector selection
x <- c("a", "b", "c")
x[c(NA, 2, 1)]
## [1] NA "b" "a"
# back to the data frame
esona[c(NA, 2, 1), ]
## agegp alcgp tobgp ncases ncontrols
## NA <NA> <NA> <NA> NA NA
## 2 25-34 0-39g/day 10-19 0 10
## 1 25-34 0-39g/day 0-9g/day 0 40
So, when R finds an NA
in a vector to be used as a slice it returns a empty version (in the data.frame
case above an empty row). How do we work around that? Incorporate the is.NA
test:
# says select ncases that are equal to 2 and not NA
esona[esona$ncases == 2 & !is.na(esona$ncases),]
## agegp alcgp tobgp ncases ncontrols
## 28 35-44 120+ 0-9g/day 2 3
## 30 35-44 120+ 20-29 2 4
## 42 45-54 80-119 30+ 2 4
## 45 45-54 120+ 20-29 2 3
## 47 55-64 0-39g/day 0-9g/day 2 NA
## 65 65-74 0-39g/day 20-29 2 7
## 72 65-74 80-119 20-29 2 3
## 79 75+ 0-39g/day 10-19 2 6
## 81 75+ 40-79 0-9g/day 2 5
## 87 75+ 120+ 0-9g/day 2 2
That works because anything that is NA
will be set to FALSE
instead of NA
and not included in the selection. Subset does not have this problem, but remember you cannot assign directly to a subset
so you’ll need the above syntax sometimes.
subset(esona, ncases == 2)
## agegp alcgp tobgp ncases ncontrols
## 28 35-44 120+ 0-9g/day 2 3
## 30 35-44 120+ 20-29 2 4
## 42 45-54 80-119 30+ 2 4
## 45 45-54 120+ 20-29 2 3
## 47 55-64 0-39g/day 0-9g/day 2 NA
## 65 65-74 0-39g/day 20-29 2 7
## 72 65-74 80-119 20-29 2 3
## 79 75+ 0-39g/day 10-19 2 6
## 81 75+ 40-79 0-9g/day 2 5
## 87 75+ 120+ 0-9g/day 2 2
Why? Because if you read the help for subset,it says uses similar syntax internally to that we used above!
You met a lot of these in the very first module:
5/0
## [1] Inf
-5/0
## [1] -Inf
0/0
## [1] NaN
NaN
means “not a number”, Inf
means “positive infinity”, and -Inf
means “negative infinity”. They are almost always seen because you are dividing by zero, usually unintentionally. You can test for these values which can help you find bugs in your programs.
is.nan(0/0)
## [1] TRUE
is.infinite(1/0)
## [1] TRUE
is.finite(1/0)
## [1] FALSE
Finally, the NULL
value is a marker for “nothing.” Not nothing as in something that ought to be there but is not (e.g., missing), but literally, something that does not or should not exist in the realm of R. It is a good way to remove a column from a data.frame
:
esocc$ncases <- NULL
head(esocc)
## agegp alcgp tobgp ncontrols
## 1 25-34 0-39g/day 0-9g/day 40
## 2 25-34 0-39g/day 10-19 10
## 3 25-34 0-39g/day 20-29 6
## 4 25-34 0-39g/day 30+ 5
## 5 25-34 40-79 0-9g/day 27
## 6 25-34 40-79 10-19 7
duplicated
finds duplicated values. It also works to find duplicated rows in data.frames
. unique
will extract the values that are, well, unique!
x <- c(4, 8, 0, 1, 8, 9, 10, 4, 2)
duplicated(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
unique(x)
## [1] 4 8 0 1 9 10 2
These functions allow you to make selections based on “set” operations: union
joins two things together (without duplication), intersection
finds the common values between two vectors, and setdiff
finds the difference between two vectors.
union(c("a", "b", "c"), c("b", "c", "d", "f"))
## [1] "a" "b" "c" "d" "f"
intersect(c("a", "b", "c"), c("b", "c", "d"))
## [1] "b" "c"
setdiff(c(1, 2, 3, 8, 9), c(2, 8))
## [1] 1 3 9
We have seen paste0
which pastes a series of strings together without any other character between.
paste0(c("a", "b", "c", "d"), 1, c(1, 2))
## [1] "a11" "b12" "c11" "d12"
The more general function is paste
. It takes two options: sep
which is the separator between the elements. It can also take a collapse
argument which if set pastes the entire sequence of strings together using that character.
paste(c("a", "b", "c"), 1, c(1, 2), sep = " ")
## [1] "a 1 1" "b 1 2" "c 1 1"
paste(c("a", "b", "c"), 1, c(1, 2), sep = " ", collapse = "-")
## [1] "a 1 1-b 1 2-c 1 1"
Sometimes you need to select out a piece of a string. The substr
function is helpful for fixed length strings. Note that the [1:3]
select out the first three elements of the names of esoph while the other 1 and 3 tell it to take the first through the third of each string. They are NOT related.
names(esoph)[1:3]
## [1] "agegp" "alcgp" "tobgp"
substr(names(esoph)[1:3], 1, 3)
## [1] "age" "alc" "tob"
Regular expressions are a very powerful way of finding patterns in text and manipulating them. You could take an entire course on regular expressions so I will only be able to introduce you to some basic aspects.
Let’s start with a list of words:
words <- c("art", "bat", "bet", "bee", "bees", "beet", "believe", "beat",
"cat", "car", "cars", "can", "cart", "mississipi",
"mart", "part", "see", "sat", "set")
We will use the grep
function which takes a regular expression and character vector to search on. If you just use a sequence of letters or numbers you will find strings that match that pattern of letters and numbers:
grep("s", words)
## [1] 5 11 14 17 18 19
# to see the matched words
grep("s", words, value = TRUE)
## [1] "bees" "cars" "mississipi" "see" "sat"
## [6] "set"
Note how each contains an “s”. Here is another example:
grep("ee", words, value = TRUE)
## [1] "bee" "bees" "beet" "see"
Nothing ground breaking there… however there are characters with special meanings. Such as “.” that will match any character. To find words that have an “a” followed by any character:
grep("a.", words, value = TRUE)
## [1] "art" "bat" "beat" "cat" "car" "cars" "can" "cart" "mart" "part"
## [11] "sat"
grep("s.t", words, value = TRUE)
## [1] "sat" "set"
You can put a group of characters in square brackets:
grep("[bs].t", words, value = TRUE)
## [1] "bat" "bet" "sat" "set"
The *
character matches zero or more characters matching the previous expression (character or group):
grep("beliea*ve", words, value = TRUE)
## [1] "believe"
The +
character matches one or more of the previous expression:
grep("beliea+ve", words, value = TRUE) # nothing matches
## character(0)
grep("be+t", words, value = TRUE)
## [1] "bet" "beet"
Finally you can group things in parentheses:
grep("m(iss)+ipi", words, value = TRUE)
## [1] "mississipi"
I find regular expressions to be very helpful for selecting a series of variables in a data.frame
:
grep("gp", names(esoph), value = TRUE)
## [1] "agegp" "alcgp" "tobgp"
head(esoph[, grep("gp", names(esoph), value = TRUE)])
## agegp alcgp tobgp
## 1 25-34 0-39g/day 0-9g/day
## 2 25-34 0-39g/day 10-19
## 3 25-34 0-39g/day 20-29
## 4 25-34 0-39g/day 30+
## 5 25-34 40-79 0-9g/day
## 6 25-34 40-79 10-19
The grepl
function is also useful as it gives the logical
vector of whether a match is found in a sequence an may be useful in some settings where the index or value will not work:
grepl("gp", names(esoph))
## [1] TRUE TRUE TRUE FALSE FALSE
There is so much more you could learn about this, but even this basic introduction can take you very far. For more information, I recommend you check out this online tutorial: http://www.zytrax.com/tech/web/regex.htm.
Please note that you can use regular expressions to select out parts of text, rearrange, or manipulate various strings. See the sub
and gsub
functions for more information.
Dates are often difficult to work with because there are many different formats and methods of converting to a system that the computer can work with.
In R, the default date format is YYYY-MM-DD
, i.e., 4 digit year, hyphen, 1–2 digit month, hyphen, and 1–2 digit day. The as.Date() function allows you to take character strings and convert them to dates that the computer can understand.
as.Date("2011-01-12")
## [1] "2011-01-12"
You can also use other formats if you receive data from someone who typed it in differently by providing a format argument:
as.Date("01/12/2011", format="%m/%d/%Y")
## [1] "2011-01-12"
as.Date("12jan2011", "%d%b%Y")
## [1] "2011-01-12"
For more information, read the help file for strptime
. The difftime
function allows you calculate various intervals between dates:
difftime(as.Date("2011-1-14"),as.Date("2011-1-12"))
## Time difference of 2 days
difftime(as.Date("2011-1-14"),as.Date("2011-1-12"),units="hours")
## Time difference of 48 hours
From the esona
data.frame
created above, select the rows where ncases
equals 3
but not NA
(without using subset
).
For regular expressions, the ^
is used to represent the beginning of a string and $
is used to select the end of a string. Use these to select the words starting with b
from words
above. Next, try to get only those words that end with s
using one of them.
Presuming that the string "12Oct1994"
represents a date that is the same as October 12, 1994 read the appropriate help file discussed above and convert that string to a date.
How many days have you been alive?
Pick a data set of your own choosing (perhaps this is an opportunity to test drive the data you want to use for your final project?). Use one of the text or date functions to do something you feel is useful to your data. Explain why and demonstrate it in a knitted .Rmd
file.