Food Prep

Author

Affiliation

Beau B. Bruce, MD, PhD

Emory University

Exposition

Introduction

This module focuses on several common data manipulation tasks: selecting, subsetting, concatenating, merging, and sorting. Some you have already seen before, but now is a good time for review and reinforcement. For example, recall that when you want to select a part of a vector, you use square brackets, [], placing between them a numeric vector that lists the elements you want to extract.

Let’s get started!

Subsetting and combining vectors

R has a built-in vector that contains the capital letters: LETTERS. Let’s try a few things with that vector. First just type, LETTERS to see what is inside.

So, how would you get the 10th letter from LETTERS?

How about selecting the 15th through the 22nd letters? Don’t forget about the : operator!

If I wanted the odd letters, I could do this:

LETTERS[c(1,3,5,7,9,11,13,15,17,19,21,23,25)]

But that’s a lot of work. Anytime you see something like this in your code you should try to figure out a way to do it more automatically. There is a function called seq that creates sequences of numbers.

Type:

seq(1, 26, by = 2)

to see what it does.

Before proceeding, take a look at the help for seq since it is a very useful function.

Now use seq to get the odd letters from LETTERS.

Note

Any of the following will work:

LETTERS[seq(1, length.out = 13, by = 2)]
LETTERS[seq(to = 25, by = 2)]
LETTERS[seq(to = 26, by = 2)]
LETTERS[seq(1, 25, by = 2)]
LETTERS[seq(1, 26, by = 2)]

Now get the even letters from LETTERS also using seq.

Another good function named rep, which is technically a mnemonic for “replicate” but I typically remember it as “repeat”. Try rep(5, 10).

Again, such a useful function that you should take a look at the help for it.

Now that you’ve looked through the help, let’s see more of the flexibility of rep. Here are some examples. Based on what you read in the help, see if you can figure out what each one will do before you run it. If any do not make sense, make a note to ask about it in class.

rep(c(1, 2), 5)

Next try:

rep(c(1, 2), each = 5)

Next try:

rep(1:3, 3:1)

And there is nothing to stop you from getting the same element of a vector more than once - try:

LETTERS[rep(1:3, 3:1)]

Want to get rid of something? Use negative numbers:

LETTERS[-26:-21]

So this is another way to get the even numbers!

LETTERS[-seq(1, 26, by = 2)]

Remember that R returns the values where they are TRUE when you use a logical vector to subset a vector.

If I’ve created a vector n5 that contains the numbers 1 through 5 like this:

Think about what you will get if you type:

n5[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

And then try it to see if you get the result you expected.

Because R “recycles” a vector used for subsetting until the length is the same as the longer vector (remember that from our early lessons?) you can get the even letters this way!

LETTERS[c(FALSE, TRUE)]

Now use the same technique for the odd letters! How?

The vector letsamp contains a vector of 100 random letters. Type letsamp to take a look at it.

Use == to test which letters are P.

Because R treats TRUE as equal to 1 when a logical vector is converted to a numeric vector you can use the sum function to count the P’s like this:

sum(letsamp == "P")

A very useful operator is the %in% operator, which tests if something is “in” the vector, and which we saw in the last module.

Use it to “mind your Ps and Qs” like this:

letsamp[letsamp %in% c("P", "Q")]

Finally, don’t forget that you can use c to concatenate multiple vectors together in very flexible ways. Try:

c(LETTERS[1:3], c("H", "I"), letsamp[letsamp %in% c("P", "Q")])

Subsetting and merging data frames

Ok, let’s turn to the doing the same sort of tasks with data.frame as with did with various vector types. I’ve pre-loaded a few data.frame’s for you. The first is called cholera_merge. Let’s look at its first 6 rows:

cholera_merge[1:6, ]

Use a logical operator to get certain rows:

cholera_merge[cholera_merge$area == 4, ]

You can use subset to seemingly do the same thing:

subset(cholera_merge, area == 4)

However, you will run into a problem if you try to do something like:

subset(cholera_merge, area == 4)$area <- 6

It won’t work. So don’t type it. Instead, you have to do this:

cholera_merge[cholera_merge$area == 4, ]$area <- 6

Verify that all the area 4 are now area 6…

subset(cholera_merge, area == 4)

… and …

subset(cholera_merge, area == 6)

Appears it worked. Now it is your turn to put cholera_merge back the way it was using the pattern of test and assignment I showed you above.

You can combine multiple conditions using & and |, for example, try:

cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]

Columns work the same as rows (I used both together to keep the output short). Try:

cholera_merge[1:3, 3:4]

Another example:

cholera_merge[1:3, c("age", "sex")]

And another:

cholera_merge[cholera_merge$area == 4, c("age", "sex")]

The function paste0 pastes together character vectors without any space between them. It’ll come in handy in a moment. Try it:

paste0("a", "b", "c")

Works with vectors as one of the arguments also. Try:

paste0("cholera", 1:5)

Another useful function is assign that has the same effect as <- but works as a function with a character vector. It allows us to programmatically specify the name using a character vector which is very helpful. Try:

assign("x", 1:3)

Check that x is what you think it should be.

Let’s divide up cholera_merge into smaller datasets by area. With assign and paste0 and the subset methods we’ve been studying. We can do it with a loop like the following. Study it carefully, make sure you understand it, and then run it.

Note

for(i in 1:5) {
  assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}

Now there are five datasets named cholera1, cholera2, $\dots$, cholera5 each containing one area’s data.

See if cholera1 is what you think it should be.

You can use the function rbind to put these five datasets back together as c_m like this:¹

c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)

Check it looks put back together with head(c_m).

NROW tells you how many rows a dataset has. Does c_m have the same number as cholera_merge?

NROW(c_m) == NROW(cholera_merge)

Not a perfect check, but looks good enough to convince us that we’ve probably reconstructed the data correctly. Now, let’s say we wanted our data in a different format. Try this:

(threecho <- cbind(cholera1, cholera2, cholera3))

See a potential problem? The names of the columns are not unique. That’s going to be an issue if you try to do some of the things we talked about because R won’t know which one you want. You can examine the names of a dataset like this:

names(threecho)

And you can change them with an assignment like this:

names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))

Try different parts from inside out before running it all if you can’t immediately see what this does.

Note

names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))

If you can’t seem to get it to work, start again at Exercise 42.

Now take a look at threecho again.

Pretty spiffy, eh? Now let’s turn to merging. This is usually what you want to do instead of cbind because order is not what matters, but making sure you are matching on the common variable(s) between the two datasets. Look at the head of another dataset you have available in this module: cholera_disease

Since cholera_disease shares the subject variable with cholera_merge you can put the two together like this:

cmer <- merge(cholera_merge, cholera_disease)

Now check out the first 10 rows of cmer like this (the second argument to head specifies how many rows you want):

head(cmer, 10)

R automatically determines the common columns by finding the columns with the same name in each data.frame, but what if no columns have the same name? Let’s change the names of cholera_disease so none match cholera_merge.

names(cholera_disease) <- c("id", "disease")

And merge:

cmer2 <- merge(cholera_merge, cholera_disease)

How many rows does cmer have? (Not cmer2, but cmer{.r}.)

That should make sense because cholera_merge had 25 rows and each one matched up with one in cholera_disease. But how many rows does cmer2 have?

So what happened? What is the square root of 625? (remember the sqrt function?)

What happened? Since there were no common columns R created the merge by combining every observation in the first data.frame with each observation in the second: $25 \times 25 = 625$. Let’s inform R what columns we want to match on by using some optional arguments of merge (definitely take a look at merge’s help at some point). The by.x argument refers to the first data.frame’s variables and the by.y argument refers to the second data.frame’s variables in merge’s arguments (when you look at the help you’ll see that the first argument is named x and the second is y).

Now merge again like this instead:

NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))

Looks like we have an expected, correct merge again; indeed you do, and I’ll let you convince yourself later if you need to. What if, instead, all the rows are not in both datasets? Let’s create a small sample (only 5 rows) from the original cholera_disease dataset.

In the code below, the set.seed function makes sure you get the same random sample each time you run the code. Let’s talk more in class about how random sampling works in computers. Then, we use sample to randomly select 5 row numbers from one to the number of rows in cholera_disease obtained by the NROW function. Finally we use those row numbers to subset cholera_disease to create cdsamp.

Merge it with cholera_merge like this:

merge(cholera_merge, cdsamp)

Holy disappearing rows, Batman! Looks like merge only keeps the matches. By default, that’s true, so be careful and be sure to specify another option if you want to keep all the rows of one of the two datasets (or both). Use the all.x = TRUE argument to keep all the rows of the first data.frame, all.y = TRUE to keep all the rows of the second data.frame, or all = TRUE to keep all rows of both. As an example, try this:

(merall <- merge(cholera_merge, cdsamp, all.x = TRUE))

All the rows from the first data.frame, i.e., x, are there with missing data from the second data.frame, i.e., y, represented by NA.

Until you gain more familiarity you should probably merge with all = TRUE because those NA’s will save you a lot of heartache when you are merging datasets that you do not realize are incompatible somehow.

Sorting vectors and data frames

OK, next topic for this exercise, i.e., sorting. Sort a vector like this:

sort(merall$age)

It is a little trickier if you want to reorder a data.frame by one of its variables. For that you need the order function, try it:

order(merall$age)

What in the world is going on here? Look at merall$age and compare to the output you just got.

OK, what is the 2nd value of merall$age? The 11th? The 18th? The 6th? Maybe you should write them down. Do you see how order tells you which values to take to sort merall$age?

If fact, you can sort merall$age like this:

merall$age[order(merall$age)]

The order function is more useful for a data.frame where it can be used to sort the data.frame by one of its variables. Sort merall by age:

merall[order(merall$age), ]

Can you make heads or tails of it?

OK, last new topic - getting the first and last items of a vector or data.frame. You’ve seen head which gets the first values of a vector or the first rows of a data.frame. It is 6 values or rows by default:

head(merall)

Want more or less? Use the second argument:

head(merall, 10)

The tail function works just like head but gives you the end rather than the top.

tail(merall, 10)

You can use negative numbers to get “all but” from the relevant end. So with head you get all the rows at the top minus that number of rows.

This will give you all but the last five rows:

head(merall, -5)

And tail again works like head - so this is all but the first 15 rows

tail(merall, -15)

Evaluation

Submit Your Assignment

Footnotes

A better way to do this exists if you have lots of datasets, but it is tricky to understand at first and we should wait until later to try it.↩︎