Food Prep

Author
Affiliation

Beau B. Bruce, MD, PhD

Emory University

Exposition

Introduction

This module focuses on several common data manipulation tasks: selecting, subsetting, concatenating, merging, and sorting. Some you have already seen before, but now is a good time for review and reinforcement. For example, recall that when you want to select a part of a vector, you use square brackets, [], placing between them a numeric vector that lists the elements you want to extract.

Let’s get started!

Subsetting and combining vectors

R has a built-in vector that contains the capital letters: LETTERS. Let’s try a few things with that vector. First just type, LETTERS to see what is inside.

Note
LETTERS
LETTERS

So, how would you get the 10th letter from LETTERS?

Note
LETTERS[10]
LETTERS[10]

How about selecting the 15th through the 22nd letters? Don’t forget about the : operator!

Note

15:22 is a part of the answer

Note
LETTERS[15:22]
LETTERS[15:22]

If I wanted the odd letters, I could do this:

LETTERS[c(1,3,5,7,9,11,13,15,17,19,21,23,25)]

But that’s a lot of work. Anytime you see something like this in your code you should try to figure out a way to do it more automatically. There is a function called seq that creates sequences of numbers.

Type:

seq(1, 26, by = 2)

to see what it does.

Note
seq(1, 26, by = 2)
seq(1, 26, by = 2)

Before proceeding, take a look at the help for seq since it is a very useful function.

Now use seq to get the odd letters from LETTERS.

Note

Does LETTERS[seq(…)] as a suggestion help?

Note

Any of the following will work:

LETTERS[seq(1, length.out = 13, by = 2)] LETTERS[seq(to = 25, by = 2)] LETTERS[seq(to = 26, by = 2)] LETTERS[seq(1, 25, by = 2)] LETTERS[seq(1, 26, by = 2)]
LETTERS[seq(1, length.out = 13, by = 2)]
LETTERS[seq(to = 25, by = 2)]
LETTERS[seq(to = 26, by = 2)]
LETTERS[seq(1, 25, by = 2)]
LETTERS[seq(1, 26, by = 2)]

Now get the even letters from LETTERS also using seq.

Note

With what number should you start?

Note
LETTERS[seq(2, length.out = 13)] LETTERS[seq(2, 26, by = 2)]
LETTERS[seq(2, length.out = 13)]
LETTERS[seq(2, 26, by = 2)]

Another good function named rep, which is technically a mnemonic for “replicate” but I typically remember it as “repeat”. Try rep(5, 10).

Note
rep(5, 10)
rep(5, 10)

Again, such a useful function that you should take a look at the help for it.

Now that you’ve looked through the help, let’s see more of the flexibility of rep. Here are some examples. Based on what you read in the help, see if you can figure out what each one will do before you run it. If any do not make sense, make a note to ask about it in class.

rep(c(1, 2), 5)

Note
rep(c(1, 2), 5)
rep(c(1, 2), 5)

Next try:

rep(c(1, 2), each = 5)

Note
rep(c(1, 2), each = 5)
rep(c(1, 2), each = 5)

Next try:

rep(1:3, 3:1)

Note
rep(1:3, 3:1)
rep(1:3, 3:1)

And there is nothing to stop you from getting the same element of a vector more than once - try:

LETTERS[rep(1:3, 3:1)]

Note
LETTERS[rep(1:3, 3:1)]
LETTERS[rep(1:3, 3:1)]

Want to get rid of something? Use negative numbers:

LETTERS[-26:-21]

Note
LETTERS[-26:-21]
LETTERS[-26:-21]

So this is another way to get the even numbers!

LETTERS[-seq(1, 26, by = 2)]

Note
LETTERS[-seq(1, 26, by = 2)]
LETTERS[-seq(1, 26, by = 2)]

Remember that R returns the values where they are TRUE when you use a logical vector to subset a vector.

If I’ve created a vector n5 that contains the numbers 1 through 5 like this:

Think about what you will get if you type:

n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

And then try it to see if you get the result you expected.

Note
n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

Because R “recycles” a vector used for subsetting until the length is the same as the longer vector (remember that from our early lessons?) you can get the even letters this way!

LETTERS[c(FALSE, TRUE)]

Note
LETTERS[c(FALSE, TRUE)]
LETTERS[c(FALSE, TRUE)]

Now use the same technique for the odd letters! How?

Note
LETTERS[c(TRUE, FALSE)]
LETTERS[c(TRUE, FALSE)]

The vector letsamp contains a vector of 100 random letters. Type letsamp to take a look at it.

Note
letsamp
letsamp

Use == to test which letters are P.

Note

Did you remember the quotes around “P”?

Note
letsamp == "P"
letsamp == "P"

Because R treats TRUE as equal to 1 when a logical vector is converted to a numeric vector you can use the sum function to count the P’s like this:

sum(letsamp == "P")

Note
sum(letsamp == "P")
sum(letsamp == "P")

A very useful operator is the %in% operator, which tests if something is “in” the vector, and which we saw in the last module.

Use it to “mind your Ps and Qs” like this:

letsamp[letsamp %in% c("P", "Q")]

Note
letsamp[letsamp %in% c("P", "Q")]
letsamp[letsamp %in% c("P", "Q")]

Finally, don’t forget that you can use c to concatenate multiple vectors together in very flexible ways. Try:

c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])

Note
c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])
c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])

Subsetting and merging data frames

Ok, let’s turn to the doing the same sort of tasks with data.frame as with did with various vector types. I’ve pre-loaded a few data.frame’s for you. The first is called cholera_merge. Let’s look at its first 6 rows:

cholera_merge[1:6, ]

Note
cholera_merge[1:6, ]
cholera_merge[1:6, ]

Use a logical operator to get certain rows:

cholera_merge[cholera_merge$area == 4, ]

Note
cholera_merge[cholera_merge$area == 4, ]
cholera_merge[cholera_merge$area == 4, ]

You can use subset to seemingly do the same thing:

subset(cholera_merge, area == 4)

Note
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)

However, you will run into a problem if you try to do something like:

subset(cholera_merge, area == 4)$area <- 6

It won’t work. So don’t type it. Instead, you have to do this:

cholera_merge[cholera_merge$area == 4, ]$area <- 6

Note
cholera_merge[cholera_merge$area == 4, ]$area <- 6
cholera_merge[cholera_merge$area == 4, ]$area <- 6

Verify that all the area 4 are now area 6…

subset(cholera_merge, area == 4)

Note
subset(cholera_merge, area == 4)
subset(cholera_merge, area == 4)

… and …

subset(cholera_merge, area == 6)

Note
subset(cholera_merge, area == 6)
subset(cholera_merge, area == 6)

Appears it worked. Now it is your turn to put cholera_merge back the way it was using the pattern of test and assignment I showed you above.

Note
cholera_merge[cholera_merge$area == 6, ]$area <- 4
cholera_merge[cholera_merge$area == 6, ]$area <- 4

You can combine multiple conditions using & and |, for example, try:

cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]

Note
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]

Columns work the same as rows (I used both together to keep the output short). Try:

cholera_merge[1:3, 3:4]

Note
cholera_merge[1:3, 3:4]
cholera_merge[1:3, 3:4]

Another example:

cholera_merge[1:3, c("age", "sex")]

Note
cholera_merge[1:3, c("age", "sex")]
cholera_merge[1:3, c("age", "sex")]

And another:

cholera_merge[cholera_merge$area == 4, c("age", "sex")]

Note
cholera_merge[cholera_merge$area == 4, c("age", "sex")]
cholera_merge[cholera_merge$area == 4, c("age", "sex")]

The function paste0 pastes together character vectors without any space between them. It’ll come in handy in a moment. Try it:

paste0("a", "b", "c")

Note

It is 0 (zero) not O (oh)

Note
paste0("a", "b", "c")
paste0("a", "b", "c")

Works with vectors as one of the arguments also. Try:

paste0("cholera", 1:5)

Note
paste0("cholera", 1:5)
paste0("cholera", 1:5)

Another useful function is assign that has the same effect as <- but works as a function with a character vector. It allows us to programmatically specify the name using a character vector which is very helpful. Try:

assign("x", 1:3)

Note
assign("x", 1:3)
assign("x", 1:3)

Check that x is what you think it should be.

Note
x
x

Let’s divide up cholera_merge into smaller datasets by area. With assign and paste0 and the subset methods we’ve been studying. We can do it with a loop like the following. Study it carefully, make sure you understand it, and then run it.

Note
for(i in 1:5) { assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ]) }
for(i in 1:5) {
  assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}

Now there are five datasets named cholera1, cholera2, \(\dots\), cholera5 each containing one area’s data.

See if cholera1 is what you think it should be.

Note
cholera1
cholera1

You can use the function rbind to put these five datasets back together as c_m like this:1

c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)

Note
c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)
c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)

Check it looks put back together with head(c_m).

Note
head(c_m)
head(c_m)

NROW tells you how many rows a dataset has. Does c_m have the same number as cholera_merge?

NROW(c_m) == NROW(cholera_merge)

Note
NROW(c_m) == NROW(cholera_merge)
NROW(c_m) == NROW(cholera_merge)

Not a perfect check, but looks good enough to convince us that we’ve probably reconstructed the data correctly. Now, let’s say we wanted our data in a different format. Try this:

(threecho <- cbind(cholera1, cholera2, cholera3))

Note
(threecho <- cbind(cholera1, cholera2, cholera3))
(threecho <- cbind(cholera1, cholera2, cholera3))

See a potential problem? The names of the columns are not unique. That’s going to be an issue if you try to do some of the things we talked about because R won’t know which one you want. You can examine the names of a dataset like this:

names(threecho)

Note
names(threecho)
names(threecho)

If you can’t seem to get it to work, start again at Exercise 42.

And you can change them with an assignment like this:

names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))

Try different parts from inside out before running it all if you can’t immediately see what this does.

Note
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))

If you can’t seem to get it to work, start again at Exercise 42.

Now take a look at threecho again.

Note
threecho
threecho

If you can’t seem to get it to work, start again at Exercise 42.

Pretty spiffy, eh? Now let’s turn to merging. This is usually what you want to do instead of cbind because order is not what matters, but making sure you are matching on the common variable(s) between the two datasets. Look at the head of another dataset you have available in this module: cholera_disease

Note
head(cholera_disease)
head(cholera_disease)

Since cholera_disease shares the subject variable with cholera_merge you can put the two together like this:

cmer <- merge(cholera_merge, cholera_disease)

Note
cmer <- merge(cholera_merge, cholera_disease)
cmer <- merge(cholera_merge, cholera_disease)

Now check out the first 10 rows of cmer like this (the second argument to head specifies how many rows you want):

head(cmer, 10)

Note
head(cmer, 10)
head(cmer, 10)

R automatically determines the common columns by finding the columns with the same name in each data.frame, but what if no columns have the same name? Let’s change the names of cholera_disease so none match cholera_merge.

names(cholera_disease) <- c("id", "disease")

Note
names(cholera_disease) <- c("id", "disease")
names(cholera_disease) <- c("id", "disease")

And merge:

cmer2 <- merge(cholera_merge, cholera_disease)

Note
cmer2 <- merge(cholera_merge, cholera_disease)
cmer2 <- merge(cholera_merge, cholera_disease)

How many rows does cmer have? (Not cmer2, but cmer{.r}.)

Note

NROW is the function you need.

Note
NROW(cmer)
NROW(cmer)

That should make sense because cholera_merge had 25 rows and each one matched up with one in cholera_disease. But how many rows does cmer2 have?

Note
NROW(cmer2)
NROW(cmer2)

So what happened? What is the square root of 625? (remember the sqrt function?)

Note
sqrt(625)
sqrt(625)

What happened? Since there were no common columns R created the merge by combining every observation in the first data.frame with each observation in the second: \(25 \times 25 = 625\). Let’s inform R what columns we want to match on by using some optional arguments of merge (definitely take a look at merge’s help at some point). The by.x argument refers to the first data.frame’s variables and the by.y argument refers to the second data.frame’s variables in merge’s arguments (when you look at the help you’ll see that the first argument is named x and the second is y).

Now merge again like this instead:

NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))

Note
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))

Looks like we have an expected, correct merge again; indeed you do, and I’ll let you convince yourself later if you need to. What if, instead, all the rows are not in both datasets? Let’s create a small sample (only 5 rows) from the original cholera_disease dataset.

In the code below, the set.seed function makes sure you get the same random sample each time you run the code. Let’s talk more in class about how random sampling works in computers. Then, we use sample to randomly select 5 row numbers from one to the number of rows in cholera_disease obtained by the NROW function. Finally we use those row numbers to subset cholera_disease to create cdsamp.

Note
set.seed(596) cdsamp <- cholera_disease[sample(NROW(cholera_disease), 5), ]
set.seed(596) 
cdsamp <- cholera_disease[sample(NROW(cholera_disease), 5), ]

Merge it with cholera_merge like this:

merge(cholera_merge, cdsamp)

Note
merge(cholera_merge, cdsamp)
merge(cholera_merge, cdsamp)

Holy disappearing rows, Batman! Looks like merge only keeps the matches. By default, that’s true, so be careful and be sure to specify another option if you want to keep all the rows of one of the two datasets (or both). Use the all.x = TRUE argument to keep all the rows of the first data.frame, all.y = TRUE to keep all the rows of the second data.frame, or all = TRUE to keep all rows of both. As an example, try this:

(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))

Note
(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))
(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))

All the rows from the first data.frame, i.e., x, are there with missing data from the second data.frame, i.e., y, represented by NA.

Until you gain more familiarity you should probably merge with all = TRUE because those NA’s will save you a lot of heartache when you are merging datasets that you do not realize are incompatible somehow.

Sorting vectors and data frames

OK, next topic for this exercise, i.e., sorting. Sort a vector like this:

sort(merall$age)

Note
sort(merall$age)
sort(merall$age)

It is a little trickier if you want to reorder a data.frame by one of its variables. For that you need the order function, try it:

order(merall$age)

Note
order(merall$age)
order(merall$age)

What in the world is going on here? Look at merall$age and compare to the output you just got.

Note
merall$age
merall$age

OK, what is the 2nd value of merall$age? The 11th? The 18th? The 6th? Maybe you should write them down. Do you see how order tells you which values to take to sort merall$age?

If fact, you can sort merall$age like this:

merall$age[order(merall$age)]

Note
merall$age[order(merall$age)]
merall$age[order(merall$age)]

The order function is more useful for a data.frame where it can be used to sort the data.frame by one of its variables. Sort merall by age:

merall[order(merall$age), ]

Note
merall[order(merall$age), ]
merall[order(merall$age), ]

Can you make heads or tails of it?

OK, last new topic - getting the first and last items of a vector or data.frame. You’ve seen head which gets the first values of a vector or the first rows of a data.frame. It is 6 values or rows by default:

head(merall)

Note
head(merall)
head(merall)

Want more or less? Use the second argument:

head(merall, 10)

Note
head(merall, 10)
head(merall, 10)

The tail function works just like head but gives you the end rather than the top.

tail(merall, 10)

Note
tail(merall, 10)
tail(merall, 10)

You can use negative numbers to get “all but” from the relevant end. So with head you get all the rows at the top minus that number of rows.

This will give you all but the last five rows:

head(merall, -5)

Note
head(merall, -5)
head(merall, -5)

And tail again works like head - so this is all but the first 15 rows

tail(merall, -15)

Note
tail(merall, -15)
tail(merall, -15)

Evaluation

Submit Your Assignment

Footnotes

  1. A better way to do this exists if you have lots of datasets, but it is tricky to understand at first and we should wait until later to try it.↩︎