Food Prep

Exercise

Slicing, Dicing, and Concatenating Vectors

This module focuses on several common data manipulation tasks: selecting, subsetting, concatinating, merging, and sorting. Some you have already seen before, but this is a good time for review and reinforcement. For example, when you want to select a part of a vector use square brackets, [], placing between them a numeric vector that lists the elements you want to extract. R has a builtin vector that contains the capital letters: LETTERS. Let’s try a few of things with that vector. Remember we are just using LETTERS as a stand in for a vector you may want to subset in some way:

LETTERS

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

# what is the 10th letter?
LETTERS[10]

## [1] "J"

# what are the 15th - 22nd letters?
LETTERS[15:22]

## [1] "O" "P" "Q" "R" "S" "T" "U" "V"

# what are the odd indexed letters?
LETTERS[c(1,3,5,7,9,11,13,15,17,19,21,23,25)]

##  [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"

The latter is a little annoying. That seems just like a task that a computer should be able to do for you. Indeed, it can, and there is a function in R called seq that can easily construct some complex sequences:

# start at 1, end at 26, go by twos
seq(1, 26, by = 2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25

LETTERS[seq(1, 26, by = 2)]

##  [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"

# how about the even ones?
LETTERS[seq(2, 26, by = 2)]

##  [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

Another good function: rep. It is technically a mnemonic for “replicate”" but I typically remember “repeat”:

# repeat the number 5 10 times
rep(5, 10)

##  [1] 5 5 5 5 5 5 5 5 5 5

# repeat c(1, 2) 5 times
rep(c(1, 2), 5)

##  [1] 1 2 1 2 1 2 1 2 1 2

# repeat 1, 5 times; and 2, 5 times
rep(c(1, 2), each = 5)

##  [1] 1 1 1 1 1 2 2 2 2 2

# repeat 1, 3 times; 2, 2 times; 3, 1 time
rep(1:3, 3:1)

## [1] 1 1 1 2 2 3

And who said you can’t ask for something more than once when you subset a vector? Nobody!

LETTERS[rep(1:3, 3:1)]

## [1] "A" "A" "A" "B" "B" "C"

Want to get rid of something? Use negative numbers:

LETTERS[-26:-21]

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T"

# another way to get the even numbers:
LETTERS[-seq(1, 26, by = 2)]

##  [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

Logical vectors return the true ones:

x <- 1:5
x[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

## [1] 1 3 4

# yet another way to get the even letters
# (remember R "recycles" shorter vectors)
LETTERS[c(FALSE, TRUE)]

##  [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

Remember you can create logical vectors using the relational and logical operators. Let’s create a random sample of 100 letters (remember the set.seed is just so you can exactly replicate my example).

set.seed(596)
letsamp <- sample(LETTERS, 100, replace = TRUE)
letsamp

##   [1] "J" "K" "S" "P" "N" "C" "Z" "Z" "H" "J" "F" "H" "M" "K" "E" "E" "Z"
##  [18] "Z" "S" "J" "D" "K" "R" "X" "E" "B" "Q" "S" "A" "K" "V" "V" "V" "F"
##  [35] "Y" "T" "E" "H" "E" "P" "I" "O" "Z" "T" "A" "T" "V" "H" "J" "F" "A"
##  [52] "E" "B" "J" "G" "C" "X" "Z" "F" "N" "W" "C" "M" "P" "E" "L" "N" "A"
##  [69] "S" "D" "O" "M" "R" "A" "M" "M" "M" "R" "M" "Y" "G" "T" "L" "F" "A"
##  [86] "T" "F" "G" "V" "B" "K" "R" "K" "R" "C" "A" "G" "E" "C" "W"

letsamp == "P"

##   [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE

# how many Ps? add it up - remember TRUE becomes 1 when using as a numeric vector
sum(letsamp == "P")

## [1] 3

# mind your Ps and Qs
letsamp[letsamp %in% c("P", "Q")]

## [1] "P" "Q" "P" "P"

Finally, don’t forget you can use c to concatenate multiple vectors together, not just single numbers:

c(LETTERS[1:3], c("H", "I"), LETTERS[24:26])

## [1] "A" "B" "C" "H" "I" "X" "Y" "Z"

Slicing, Dicing, and Concatenating Dataframes

Load in the datasets with:

library(devtools) # install_github("advdatamgmt/adm") if you need to

library(adm)

A lot of what you need is analogous to vectors. Just remember that data.frame has rows and columns, rows come first.

# first 6 rows of cholera_merge
cholera_merge[1:6, ]

##   area subject age sex
## 1    1    1001  34   M
## 2    1    1002  20   F
## 3    1    1003  32   M
## 4    1    1004  34   F
## 5    1    1005  57   M
## 6    2    2001  23   M

Using a logical to select rows:

cholera_merge[cholera_merge$area == 4, ]

##    area subject age sex
## 16    4    4001  49   F
## 17    4    4002  29   F
## 18    4    4003  20   M
## 19    4    4004  58   F
## 20    4    4005  35   M

# or the subset command
subset(cholera_merge, area == 4)

##    area subject age sex
## 16    4    4001  49   F
## 17    4    4002  29   F
## 18    4    4003  20   M
## 19    4    4004  58   F
## 20    4    4005  35   M

The second seems easier, so why use the first? You’ll need it if you want to assign back to your selection. For example:

cholera_merge[cholera_merge$area == 4, ]$area <- 6
# none
subset(cholera_merge, area == 4)

## [1] area    subject age     sex    
## <0 rows> (or 0-length row.names)

# because they are here
subset(cholera_merge, area == 6)

##    area subject age sex
## 16    6    4001  49   F
## 17    6    4002  29   F
## 18    6    4003  20   M
## 19    6    4004  58   F
## 20    6    4005  35   M

# doesn't work
subset(cholera_merge, area == 6)$area <- 6

## Error in subset(cholera_merge, area == 6)$area <- 6: could not find function "subset<-"

# put it back
cholera_merge[cholera_merge$area == 6, ]$area <- 4

# women from area 3
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]

##    area subject age sex
## 12    3    3002  55   F
## 13    3    3003  58   F
## 14    3    3004  60   F

Columns work the same and together with rows if you want (here I used both to keep the output short):

cholera_merge[1:3, 3:4]

##   age sex
## 1  34   M
## 2  20   F
## 3  32   M

cholera_merge[1:3, c("age", "sex")]

##   age sex
## 1  34   M
## 2  20   F
## 3  32   M

cholera_merge[cholera_merge$area == 4, c("age", "sex")]

##    age sex
## 16  49   F
## 17  29   F
## 18  20   M
## 19  58   F
## 20  35   M

Use cbind to “bind” columns together and rbind to “bind” rows together (concatination of columns and rows). Let’s divide up cholera_merge into smaller datasets. We can use a loop and the assign function which has the same effect as <- but works as a function (allowing us to programmatically specify the name). The function paste0 pastes together character vectors without any space between them.

paste0("a", "b", "c")

## [1] "abc"

paste0("cholera", 1:5)

## [1] "cholera1" "cholera2" "cholera3" "cholera4" "cholera5"

assign("x", 1:3)
x

## [1] 1 2 3

for(i in 1:5) {
  assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}

ls() # lists your assigned variables

##  [1] "c_m"             "cdsamp"          "cholera_disease"
##  [4] "cholera_merge"   "cholera1"        "cholera2"       
##  [7] "cholera3"        "cholera4"        "cholera5"       
## [10] "cmer"            "cmer2"           "df"             
## [13] "i"               "letsamp"         "merall"         
## [16] "params"          "t"               "threecho"       
## [19] "x"

cholera1

##   area subject age sex
## 1    1    1001  34   M
## 2    1    1002  20   F
## 3    1    1003  32   M
## 4    1    1004  34   F
## 5    1    1005  57   M

cholera2

##    area subject age sex
## 6     2    2001  23   M
## 7     2    2002  50   F
## 8     2    2003  74   F
## 9     2    2004  47   M
## 10    2    2005  40   F

Now put it back together:

c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)
c_m[1:6, ]

##   area subject age sex
## 1    1    1001  34   M
## 2    1    1002  20   F
## 3    1    1003  32   M
## 4    1    1004  34   F
## 5    1    1005  57   M
## 6    2    2001  23   M

NROW(c_m) == NROW(cholera_merge)

## [1] TRUE

Do something strange with the columns:

cbind(cholera1[, 1:2], cholera2[, 3:4])

##   area subject age sex
## 1    1    1001  23   M
## 2    1    1002  50   F
## 3    1    1003  74   F
## 4    1    1004  47   M
## 5    1    1005  40   F

What if you cbind them a few of the little datasets together?

# extra set of parentheses prints the assignment
(threecho <- cbind(cholera1, cholera2, cholera3))

##   area subject age sex area subject age sex area subject age sex
## 1    1    1001  34   M    2    2001  23   M    3    3001  20   M
## 2    1    1002  20   F    2    2002  50   F    3    3002  55   F
## 3    1    1003  32   M    2    2003  74   F    3    3003  58   F
## 4    1    1004  34   F    2    2004  47   M    3    3004  60   F
## 5    1    1005  57   M    2    2005  40   F    3    3005  40   M

# hmmm... variables with same name
names(threecho)

##  [1] "area"    "subject" "age"     "sex"     "area"    "subject" "age"    
##  [8] "sex"     "area"    "subject" "age"     "sex"

# guess what you can assign directly to names
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
threecho

##   area1 subject1 age1 sex1 area2 subject2 age2 sex2 area3 subject3 age3
## 1     1     1001   34    M     2     2001   23    M     3     3001   20
## 2     1     1002   20    F     2     2002   50    F     3     3002   55
## 3     1     1003   32    M     2     2003   74    F     3     3003   58
## 4     1     1004   34    F     2     2004   47    M     3     3004   60
## 5     1     1005   57    M     2     2005   40    F     3     3005   40
##   sex3
## 1    M
## 2    F
## 3    F
## 4    F
## 5    M

Merging dataframes

Merging is the way you combine datasets with common variable names. Take cholera_disease as an example:

cholera_disease[1:5, ]

##     id disease
## 1 1001       0
## 2 1002       0
## 3 1003       0
## 4 1004       0
## 5 1005       1

It shares the subject with cholera_merge so we can merge them together:

cmer <- merge(cholera_merge, cholera_disease)
cmer[1:10, ]

##    area subject age sex   id disease
## 1     1    1001  34   M 1001       0
## 2     1    1002  20   F 1001       0
## 3     1    1003  32   M 1001       0
## 4     1    1004  34   F 1001       0
## 5     1    1005  57   M 1001       0
## 6     2    2001  23   M 1001       0
## 7     2    2002  50   F 1001       0
## 8     2    2003  74   F 1001       0
## 9     2    2004  47   M 1001       0
## 10    2    2005  40   F 1001       0

R automatically determines the common columns if their names are the same, but what if they are not? You’ll have to use some of the optional arguments to the function.

names(cholera_disease) <- c("id", "disease")
cmer2 <- merge(cholera_merge, cholera_disease)
# what happened?
NROW(cmer)

## [1] 625

NROW(cmer2)

## [1] 625

What happened? Since there were no common columns R created the merge by combining every observation in the first data.frame with each observation in the second: $25 x 25 = 625$. So, tell R what to merge on (the first dataset is x and the second is y in the naming convention):

NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))

## [1] 25

What if observations are missing in one of the data.frames?

set.seed(596) # because sample is random
(cdsamp <- cholera_disease[sample(1:NROW(cholera_disease), 5), ])

##      id disease
## 10 2005       1
## 25 5005       0
## 17 4002       1
## 13 3003       1
## 12 3002       1

merge(cholera_merge, cdsamp, by.x = "subject", by.y = "id")

##   subject area age sex disease
## 1    2005    2  40   F       1
## 2    3002    3  55   F       1
## 3    3003    3  58   F       1
## 4    4002    4  29   F       1
## 5    5005    5  77   F       0

Holy disappearing rows, Batman! Merge only keeps the matches. Be careful with that fact and be sure to specify another option if you want to keep all the rows of one:

(merall <- merge(cholera_merge, cdsamp, by.x = "subject", by.y = "id", all.x = TRUE))

##    subject area age sex disease
## 1     1001    1  34   M      NA
## 2     1002    1  20   F      NA
## 3     1003    1  32   M      NA
## 4     1004    1  34   F      NA
## 5     1005    1  57   M      NA
## 6     2001    2  23   M      NA
## 7     2002    2  50   F      NA
## 8     2003    2  74   F      NA
## 9     2004    2  47   M      NA
## 10    2005    2  40   F       1
## 11    3001    3  20   M      NA
## 12    3002    3  55   F       1
## 13    3003    3  58   F       1
## 14    3004    3  60   F      NA
## 15    3005    3  40   M      NA
## 16    4001    4  49   F      NA
## 17    4002    4  29   F       1
## 18    4003    4  20   M      NA
## 19    4004    4  58   F      NA
## 20    4005    4  35   M      NA
## 21    5001    5  39   M      NA
## 22    5002    5  27   M      NA
## 23    5003    5  49   M      NA
## 24    5004    5  63   M      NA
## 25    5005    5  77   F       0

merall[20:25, ]

##    subject area age sex disease
## 20    4005    4  35   M      NA
## 21    5001    5  39   M      NA
## 22    5002    5  27   M      NA
## 23    5003    5  49   M      NA
## 24    5004    5  63   M      NA
## 25    5005    5  77   F       0

All the rows from the x data.frame are there with missing data from the y one represented by NA. Most frequently you want to use one or both of the all options which will save you a lot of heartache when you are merging datasets that you do not realize are incompatible.

Sorting vectors and dataframes

Sorting is a common operation.

sort(merall$age)

##  [1] 20 20 20 23 27 29 32 34 34 35 39 40 40 47 49 49 50 55 57 58 58 60 63
## [24] 74 77

But what if you want to sort by something in a data.frame. What you would want to know is the “subset” that was ordered in a way that it returned the elements in sorted order. That’s where the order function comes in:

merall$age

##  [1] 34 20 32 34 57 23 50 74 47 40 20 55 58 60 40 49 29 20 58 35 39 27 49
## [24] 63 77

order(merall$age)

##  [1]  2 11 18  6 22 17  3  1  4 20 21 10 15  9 16 23  7 12  5 13 19 14 24
## [24]  8 25

See how if you take each element of merall$age given by order you will get the sorted list? What is the 2nd element of merall$age? 20 The 11th? 20. So:

merall$age[order(merall$age)]

##  [1] 20 20 20 23 27 29 32 34 34 35 39 40 40 47 49 49 50 55 57 58 58 60 63
## [24] 74 77

identical(merall$age[order(merall$age)], sort(merall$age))

## [1] TRUE

order is more useful for data.frames where it can be used to sort the data.frame. Here we sort merall by age:

merall[order(merall$age), ]

##    subject area age sex disease
## 2     1002    1  20   F      NA
## 11    3001    3  20   M      NA
## 18    4003    4  20   M      NA
## 6     2001    2  23   M      NA
## 22    5002    5  27   M      NA
## 17    4002    4  29   F       1
## 3     1003    1  32   M      NA
## 1     1001    1  34   M      NA
## 4     1004    1  34   F      NA
## 20    4005    4  35   M      NA
## 21    5001    5  39   M      NA
## 10    2005    2  40   F       1
## 15    3005    3  40   M      NA
## 9     2004    2  47   M      NA
## 16    4001    4  49   F      NA
## 23    5003    5  49   M      NA
## 7     2002    2  50   F      NA
## 12    3002    3  55   F       1
## 5     1005    1  57   M      NA
## 13    3003    3  58   F       1
## 19    4004    4  58   F      NA
## 14    3004    3  60   F      NA
## 24    5004    5  63   M      NA
## 8     2003    2  74   F      NA
## 25    5005    5  77   F       0

Extracting the first and last rows of dataframes

head and tail extract the first and last elements of a vector or the first and last rows of a data.frame (6 by default):

head(merall)

##   subject area age sex disease
## 1    1001    1  34   M      NA
## 2    1002    1  20   F      NA
## 3    1003    1  32   M      NA
## 4    1004    1  34   F      NA
## 5    1005    1  57   M      NA
## 6    2001    2  23   M      NA

tail(merall)

##    subject area age sex disease
## 20    4005    4  35   M      NA
## 21    5001    5  39   M      NA
## 22    5002    5  27   M      NA
## 23    5003    5  49   M      NA
## 24    5004    5  63   M      NA
## 25    5005    5  77   F       0

You can add an optional argument to tell it how many to extract:

head(merall, 1)

##   subject area age sex disease
## 1    1001    1  34   M      NA

tail(merall, 2)

##    subject area age sex disease
## 24    5004    5  63   M      NA
## 25    5005    5  77   F       0

You can use negative numbers to extract “all but” that many:

head(merall, -10)

##    subject area age sex disease
## 1     1001    1  34   M      NA
## 2     1002    1  20   F      NA
## 3     1003    1  32   M      NA
## 4     1004    1  34   F      NA
## 5     1005    1  57   M      NA
## 6     2001    2  23   M      NA
## 7     2002    2  50   F      NA
## 8     2003    2  74   F      NA
## 9     2004    2  47   M      NA
## 10    2005    2  40   F       1
## 11    3001    3  20   M      NA
## 12    3002    3  55   F       1
## 13    3003    3  58   F       1
## 14    3004    3  60   F      NA
## 15    3005    3  40   M      NA

tail(merall, -15)

##    subject area age sex disease
## 16    4001    4  49   F      NA
## 17    4002    4  29   F       1
## 18    4003    4  20   M      NA
## 19    4004    4  58   F      NA
## 20    4005    4  35   M      NA
## 21    5001    5  39   M      NA
## 22    5002    5  27   M      NA
## 23    5003    5  49   M      NA
## 24    5004    5  63   M      NA
## 25    5005    5  77   F       0

Explore and Extend

Write a function named extreme that takes a numeric vector as input and returns a list of two elements: the first named smallest containing the smallest three elements in the numeric vector and the second list element named biggest containing the largest three elements in the numeric vector.
Write a function sampdf that takes a data.frame as an argument and single number and returns a random sample of the rows of the dataframe with the number of rows specified in the argument.
Write a loop that takes the cholera_merge and extracts the rows containing the oldest person in each area. Hint: combine select/subset techniques, with order and tail. For now, just use the print function to print out each row.

Evaluate

Take the last “Explore and Extend” exercise one more step and use rbind to create a new data.frame from the rows. On the first round of the loop, you’ll need to save the first row as the new data.frame. On the subsequent iterations (rounds through the loop), you’ll need to rbind to the data.frame and save that to the data.frame variable name you choose. Hear some branch logic that description?

for(i in 1:5) {
  x <- cholera_merge[cholera_merge$area == i, ]
  x <- x[order(x$age), ]
  t <- tail(x, 1)
  if(i == 1) {
    df <- t
  } else {
    df <- rbind(df, t)
  }
}

Then, think about how you can use this for our esoph transformation from case-control to long format. Try out your ideas as time permits.