Food Prep
Exposition
Introduction
This module focuses on several common data manipulation tasks: selecting, subsetting, concatenating, merging, and sorting. Some you have already seen before, but now is a good time for review and reinforcement. For example, recall that when you want to select a part of a vector, you use square brackets, []
, placing between them a numeric vector that lists the elements you want to extract.
Let’s get started!
Subsetting and combining vectors
R has a built-in vector that contains the capital letters: LETTERS
. Let’s try a few things with that vector. First just type, LETTERS
to see what is inside.
LETTERS
LETTERS
So, how would you get the 10th letter from LETTERS
?
LETTERS[10]
10] LETTERS[
How about selecting the 15th through the 22nd letters? Don’t forget about the :
operator!
15:22 is a part of the answer
LETTERS[15:22]
15:22] LETTERS[
If I wanted the odd letters, I could do this:
LETTERS[c(1,3,5,7,9,11,13,15,17,19,21,23,25)]
But that’s a lot of work. Anytime you see something like this in your code you should try to figure out a way to do it more automatically. There is a function called seq
that creates sequences of numbers.
Type:
seq(1, 26, by = 2)
to see what it does.
seq(1, 26, by = 2)
seq(1, 26, by = 2)
Before proceeding, take a look at the help for seq
since it is a very useful function.
Now use seq
to get the odd letters from LETTERS
.
Does LETTERS[seq(…)] as a suggestion help?
Any of the following will work:
LETTERS[seq(1, length.out = 13, by = 2)]
LETTERS[seq(to = 25, by = 2)]
LETTERS[seq(to = 26, by = 2)]
LETTERS[seq(1, 25, by = 2)]
LETTERS[seq(1, 26, by = 2)]
seq(1, length.out = 13, by = 2)]
LETTERS[seq(to = 25, by = 2)]
LETTERS[seq(to = 26, by = 2)]
LETTERS[seq(1, 25, by = 2)]
LETTERS[seq(1, 26, by = 2)] LETTERS[
Now get the even letters from LETTERS
also using seq
.
With what number should you start?
LETTERS[seq(2, length.out = 13)]
LETTERS[seq(2, 26, by = 2)]
seq(2, length.out = 13)]
LETTERS[seq(2, 26, by = 2)] LETTERS[
Another good function named rep
, which is technically a mnemonic for “replicate” but I typically remember it as “repeat”. Try rep(5, 10)
.
rep(5, 10)
rep(5, 10)
Again, such a useful function that you should take a look at the help for it.
Now that you’ve looked through the help, let’s see more of the flexibility of rep
. Here are some examples. Based on what you read in the help, see if you can figure out what each one will do before you run it. If any do not make sense, make a note to ask about it in class.
rep(c(1, 2), 5)
rep(c(1, 2), 5)
rep(c(1, 2), 5)
Next try:
rep(c(1, 2), each = 5)
rep(c(1, 2), each = 5)
rep(c(1, 2), each = 5)
Next try:
rep(1:3, 3:1)
rep(1:3, 3:1)
rep(1:3, 3:1)
And there is nothing to stop you from getting the same element of a vector more than once - try:
LETTERS[rep(1:3, 3:1)]
LETTERS[rep(1:3, 3:1)]
rep(1:3, 3:1)] LETTERS[
Want to get rid of something? Use negative numbers:
LETTERS[-26:-21]
LETTERS[-26:-21]
-26:-21] LETTERS[
So this is another way to get the even numbers!
LETTERS[-seq(1, 26, by = 2)]
LETTERS[-seq(1, 26, by = 2)]
-seq(1, 26, by = 2)] LETTERS[
Remember that R returns the values where they are TRUE
when you use a logical vector to subset a vector.
If I’ve created a vector n5
that contains the numbers 1 through 5 like this:
Think about what you will get if you type:
n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
And then try it to see if you get the result you expected.
n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
c(TRUE, FALSE, TRUE, TRUE, FALSE)] n5[
Because R “recycles” a vector used for subsetting until the length is the same as the longer vector (remember that from our early lessons?) you can get the even letters this way!
LETTERS[c(FALSE, TRUE)]
LETTERS[c(FALSE, TRUE)]
c(FALSE, TRUE)] LETTERS[
Now use the same technique for the odd letters! How?
LETTERS[c(TRUE, FALSE)]
c(TRUE, FALSE)] LETTERS[
The vector letsamp
contains a vector of 100 random letters. Type letsamp
to take a look at it.
letsamp
letsamp
Use ==
to test which letters are P.
Did you remember the quotes around “P”?
letsamp == "P"
== "P" letsamp
Because R treats TRUE
as equal to 1
when a logical vector is converted to a numeric vector you can use the sum
function to count the P’s like this:
sum(letsamp == "P")
sum(letsamp == "P")
sum(letsamp == "P")
A very useful operator is the %in%
operator, which tests if something is “in” the vector, and which we saw in the last module.
Use it to “mind your Ps and Qs” like this:
letsamp[letsamp %in% c("P", "Q")]
letsamp[letsamp %in% c("P", "Q")]
%in% c("P", "Q")] letsamp[letsamp
Finally, don’t forget that you can use c
to concatenate multiple vectors together in very flexible ways. Try:
c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])
c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])
c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])
Subsetting and merging data frames
Ok, let’s turn to the doing the same sort of tasks with data.frame
as with did with various vector
types. I’ve pre-loaded a few data.frame
’s for you. The first is called cholera_merge
. Let’s look at its first 6 rows:
cholera_merge[1:6, ]
cholera_merge[1:6, ]
1:6, ] cholera_merge[
Use a logical operator to get certain rows:
cholera_merge[cholera_merge$area == 4, ]
cholera_merge[cholera_merge$area == 4, ]
$area == 4, ] cholera_merge[cholera_merge
You can use subset
to seemingly do the same thing:
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)
However, you will run into a problem if you try to do something like:
subset(cholera_merge, area == 4)$area <- 6
It won’t work. So don’t type it. Instead, you have to do this:
cholera_merge[cholera_merge$area == 4, ]$area <- 6
cholera_merge[cholera_merge$area == 4, ]$area <- 6
$area == 4, ]$area <- 6 cholera_merge[cholera_merge
Verify that all the area 4 are now area 6…
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)
… and …
subset(cholera_merge, area == 6)
subset(cholera_merge, area == 6)
subset(cholera_merge, area == 6)
Appears it worked. Now it is your turn to put cholera_merge
back the way it was using the pattern of test and assignment I showed you above.
cholera_merge[cholera_merge$area == 6, ]$area <- 4
$area == 6, ]$area <- 4 cholera_merge[cholera_merge
You can combine multiple conditions using &
and |
, for example, try:
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]
$sex == "F" & cholera_merge$area == 3, ] cholera_merge[cholera_merge
Columns work the same as rows (I used both together to keep the output short). Try:
cholera_merge[1:3, 3:4]
cholera_merge[1:3, 3:4]
1:3, 3:4] cholera_merge[
Another example:
cholera_merge[1:3, c("age", "sex")]
cholera_merge[1:3, c("age", "sex")]
1:3, c("age", "sex")] cholera_merge[
And another:
cholera_merge[cholera_merge$area == 4, c("age", "sex")]
cholera_merge[cholera_merge$area == 4, c("age", "sex")]
$area == 4, c("age", "sex")] cholera_merge[cholera_merge
The function paste0
pastes together character vectors without any space between them. It’ll come in handy in a moment. Try it:
paste0("a", "b", "c")
It is 0
(zero) not O
(oh)
paste0("a", "b", "c")
paste0("a", "b", "c")
Works with vectors as one of the arguments also. Try:
paste0("cholera", 1:5)
paste0("cholera", 1:5)
paste0("cholera", 1:5)
Another useful function is assign
that has the same effect as <-
but works as a function with a character vector. It allows us to programmatically specify the name using a character vector which is very helpful. Try:
assign("x", 1:3)
assign("x", 1:3)
assign("x", 1:3)
Check that x
is what you think it should be.
x
x
Let’s divide up cholera_merge
into smaller datasets by area. With assign
and paste0
and the subset methods we’ve been studying. We can do it with a loop like the following. Study it carefully, make sure you understand it, and then run it.
for(i in 1:5) {
assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}
for(i in 1:5) {
assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}
Now there are five datasets named cholera1
, cholera2
, \(\dots\), cholera5
each containing one area’s data.
See if cholera1
is what you think it should be.
cholera1
cholera1
You can use the function rbind
to put these five datasets back together as c_m
like this:1
c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)
c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)
<- rbind(cholera1, cholera2, cholera3, cholera4, cholera5) c_m
Check it looks put back together with head(c_m)
.
head(c_m)
head(c_m)
NROW
tells you how many rows a dataset has. Does c_m
have the same number as cholera_merge
?
NROW(c_m) == NROW(cholera_merge)
NROW(c_m) == NROW(cholera_merge)
NROW(c_m) == NROW(cholera_merge)
Not a perfect check, but looks good enough to convince us that we’ve probably reconstructed the data correctly. Now, let’s say we wanted our data in a different format. Try this:
(threecho <- cbind(cholera1, cholera2, cholera3))
(threecho <- cbind(cholera1, cholera2, cholera3))
<- cbind(cholera1, cholera2, cholera3)) (threecho
See a potential problem? The names of the columns are not unique. That’s going to be an issue if you try to do some of the things we talked about because R won’t know which one you want. You can examine the names of a dataset like this:
names(threecho)
names(threecho)
names(threecho)
If you can’t seem to get it to work, start again at Exercise 42.
And you can change them with an assignment like this:
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
Try different parts from inside out before running it all if you can’t immediately see what this does.
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
If you can’t seem to get it to work, start again at Exercise 42.
Now take a look at threecho
again.
threecho
threecho
If you can’t seem to get it to work, start again at Exercise 42.
Pretty spiffy, eh? Now let’s turn to merging. This is usually what you want to do instead of cbind
because order is not what matters, but making sure you are matching on the common variable(s) between the two datasets. Look at the head of another dataset you have available in this module: cholera_disease
head(cholera_disease)
head(cholera_disease)
Since cholera_disease
shares the subject
variable with cholera_merge
you can put the two together like this:
cmer <- merge(cholera_merge, cholera_disease)
cmer <- merge(cholera_merge, cholera_disease)
<- merge(cholera_merge, cholera_disease) cmer
Now check out the first 10 rows of cmer
like this (the second argument to head specifies how many rows you want):
head(cmer, 10)
head(cmer, 10)
head(cmer, 10)
R automatically determines the common columns by finding the columns with the same name in each data.frame
, but what if no columns have the same name? Let’s change the names of cholera_disease
so none match cholera_merge
.
names(cholera_disease) <- c("id", "disease")
names(cholera_disease) <- c("id", "disease")
names(cholera_disease) <- c("id", "disease")
And merge:
cmer2 <- merge(cholera_merge, cholera_disease)
cmer2 <- merge(cholera_merge, cholera_disease)
<- merge(cholera_merge, cholera_disease) cmer2
How many rows does cmer
have? (Not cmer2
, but cmer{.r}.)
NROW
is the function you need.
NROW(cmer)
NROW(cmer)
That should make sense because cholera_merge
had 25 rows and each one matched up with one in cholera_disease
. But how many rows does cmer2
have?
NROW(cmer2)
NROW(cmer2)
So what happened? What is the square root of 625? (remember the sqrt
function?)
sqrt(625)
sqrt(625)
What happened? Since there were no common columns R created the merge by combining every observation in the first data.frame
with each observation in the second: \(25 \times 25 = 625\). Let’s inform R what columns we want to match on by using some optional arguments of merge
(definitely take a look at merge
’s help at some point). The by.x
argument refers to the first data.frame
’s variables and the by.y
argument refers to the second data.frame
’s variables in merge
’s arguments (when you look at the help you’ll see that the first argument is named x
and the second is y
).
Now merge again like this instead:
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))
Looks like we have an expected, correct merge again; indeed you do, and I’ll let you convince yourself later if you need to. What if, instead, all the rows are not in both datasets? Let’s create a small sample (only 5 rows) from the original cholera_disease
dataset.
In the code below, the set.seed
function makes sure you get the same random sample each time you run the code. Let’s talk more in class about how random sampling works in computers. Then, we use sample
to randomly select 5 row numbers from one to the number of rows in cholera_disease
obtained by the NROW
function. Finally we use those row numbers to subset cholera_disease
to create cdsamp
.
set.seed(596)
cdsamp <- cholera_disease[sample(NROW(cholera_disease), 5), ]
set.seed(596)
<- cholera_disease[sample(NROW(cholera_disease), 5), ] cdsamp
Merge it with cholera_merge like this:
merge(cholera_merge, cdsamp)
merge(cholera_merge, cdsamp)
merge(cholera_merge, cdsamp)
Holy disappearing rows, Batman! Looks like merge
only keeps the matches. By default, that’s true, so be careful and be sure to specify another option if you want to keep all the rows of one of the two datasets (or both). Use the all.x = TRUE
argument to keep all the rows of the first data.frame
, all.y = TRUE
to keep all the rows of the second data.frame
, or all = TRUE
to keep all rows of both. As an example, try this:
(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))
(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))
<- merge(cholera_merge, cdsamp, all.x = TRUE)) (merall
All the rows from the first data.frame
, i.e., x
, are there with missing data from the second data.frame
, i.e., y
, represented by NA
.
Until you gain more familiarity you should probably merge with all = TRUE
because those NA
’s will save you a lot of heartache when you are merging datasets that you do not realize are incompatible somehow.
Sorting vectors and data frames
OK, next topic for this exercise, i.e., sorting. Sort a vector
like this:
sort(merall$age)
sort(merall$age)
sort(merall$age)
It is a little trickier if you want to reorder a data.frame
by one of its variables. For that you need the order function, try it:
order(merall$age)
order(merall$age)
order(merall$age)
What in the world is going on here? Look at merall$age
and compare to the output you just got.
merall$age
$age merall
OK, what is the 2nd value of merall$age
? The 11th? The 18th? The 6th? Maybe you should write them down. Do you see how order tells you which values to take to sort merall$age
?
If fact, you can sort merall$age
like this:
merall$age[order(merall$age)]
merall$age[order(merall$age)]
$age[order(merall$age)] merall
The order
function is more useful for a data.frame
where it can be used to sort the data.frame by one of its variables. Sort merall
by age:
merall[order(merall$age), ]
merall[order(merall$age), ]
order(merall$age), ] merall[
Can you make heads or tails of it?
OK, last new topic - getting the first and last items of a vector or data.frame. You’ve seen head
which gets the first values of a vector or the first rows of a data.frame
. It is 6 values or rows by default:
head(merall)
head(merall)
head(merall)
Want more or less? Use the second argument:
head(merall, 10)
head(merall, 10)
head(merall, 10)
The tail
function works just like head
but gives you the end rather than the top.
tail(merall, 10)
tail(merall, 10)
tail(merall, 10)
You can use negative numbers to get “all but” from the relevant end. So with head
you get all the rows at the top minus that number of rows.
This will give you all but the last five rows:
head(merall, -5)
head(merall, -5)
head(merall, -5)
And tail
again works like head
- so this is all but the first 15 rows
tail(merall, -15)
tail(merall, -15)
tail(merall, -15)
Evaluation
Submit Your Assignment
Footnotes
A better way to do this exists if you have lots of datasets, but it is tricky to understand at first and we should wait until later to try it.↩︎