This module focuses on several common data manipulation tasks: selecting, subsetting, concatinating, merging, and sorting. Some you have already seen before, but this is a good time for review and reinforcement. For example, when you want to select a part of a vector
use square brackets, []
, placing between them a numeric vector that lists the elements you want to extract. R has a builtin vector
that contains the capital letters: LETTERS
. Let’s try a few of things with that vector
. Remember we are just using LETTERS
as a stand in for a vector
you may want to subset in some way:
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
# what is the 10th letter?
LETTERS[10]
## [1] "J"
# what are the 15th - 22nd letters?
LETTERS[15:22]
## [1] "O" "P" "Q" "R" "S" "T" "U" "V"
# what are the odd indexed letters?
LETTERS[c(1,3,5,7,9,11,13,15,17,19,21,23,25)]
## [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
The latter is a little annoying. That seems just like a task that a computer should be able to do for you. Indeed, it can, and there is a function in R called seq
that can easily construct some complex sequences:
# start at 1, end at 26, go by twos
seq(1, 26, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25
LETTERS[seq(1, 26, by = 2)]
## [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
# how about the even ones?
LETTERS[seq(2, 26, by = 2)]
## [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"
Another good function: rep
. It is technically a mnemonic for “replicate”" but I typically remember “repeat”:
# repeat the number 5 10 times
rep(5, 10)
## [1] 5 5 5 5 5 5 5 5 5 5
# repeat c(1, 2) 5 times
rep(c(1, 2), 5)
## [1] 1 2 1 2 1 2 1 2 1 2
# repeat 1, 5 times; and 2, 5 times
rep(c(1, 2), each = 5)
## [1] 1 1 1 1 1 2 2 2 2 2
# repeat 1, 3 times; 2, 2 times; 3, 1 time
rep(1:3, 3:1)
## [1] 1 1 1 2 2 3
And who said you can’t ask for something more than once when you subset a vector? Nobody!
LETTERS[rep(1:3, 3:1)]
## [1] "A" "A" "A" "B" "B" "C"
Want to get rid of something? Use negative numbers:
LETTERS[-26:-21]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T"
# another way to get the even numbers:
LETTERS[-seq(1, 26, by = 2)]
## [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"
Logical vectors return the true ones:
x <- 1:5
x[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
## [1] 1 3 4
# yet another way to get the even letters
# (remember R "recycles" shorter vectors)
LETTERS[c(FALSE, TRUE)]
## [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"
Remember you can create logical vectors using the relational and logical operators. Let’s create a random sample of 100 letters (remember the set.seed
is just so you can exactly replicate my example).
set.seed(596)
letsamp <- sample(LETTERS, 100, replace = TRUE)
letsamp
## [1] "J" "K" "S" "P" "N" "C" "Z" "Z" "H" "J" "F" "H" "M" "K" "E" "E" "Z"
## [18] "Z" "S" "J" "D" "K" "R" "X" "E" "B" "Q" "S" "A" "K" "V" "V" "V" "F"
## [35] "Y" "T" "E" "H" "E" "P" "I" "O" "Z" "T" "A" "T" "V" "H" "J" "F" "A"
## [52] "E" "B" "J" "G" "C" "X" "Z" "F" "N" "W" "C" "M" "P" "E" "L" "N" "A"
## [69] "S" "D" "O" "M" "R" "A" "M" "M" "M" "R" "M" "Y" "G" "T" "L" "F" "A"
## [86] "T" "F" "G" "V" "B" "K" "R" "K" "R" "C" "A" "G" "E" "C" "W"
letsamp == "P"
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
# how many Ps? add it up - remember TRUE becomes 1 when using as a numeric vector
sum(letsamp == "P")
## [1] 3
# mind your Ps and Qs
letsamp[letsamp %in% c("P", "Q")]
## [1] "P" "Q" "P" "P"
Finally, don’t forget you can use c
to concatenate multiple vectors together, not just single numbers:
c(LETTERS[1:3], c("H", "I"), LETTERS[24:26])
## [1] "A" "B" "C" "H" "I" "X" "Y" "Z"
Load in the datasets with:
library(devtools) # install_github("advdatamgmt/adm") if you need to
library(adm)
A lot of what you need is analogous to vectors. Just remember that data.frame
has rows and columns, rows come first.
# first 6 rows of cholera_merge
cholera_merge[1:6, ]
## area subject age sex
## 1 1 1001 34 M
## 2 1 1002 20 F
## 3 1 1003 32 M
## 4 1 1004 34 F
## 5 1 1005 57 M
## 6 2 2001 23 M
Using a logical to select rows:
cholera_merge[cholera_merge$area == 4, ]
## area subject age sex
## 16 4 4001 49 F
## 17 4 4002 29 F
## 18 4 4003 20 M
## 19 4 4004 58 F
## 20 4 4005 35 M
# or the subset command
subset(cholera_merge, area == 4)
## area subject age sex
## 16 4 4001 49 F
## 17 4 4002 29 F
## 18 4 4003 20 M
## 19 4 4004 58 F
## 20 4 4005 35 M
The second seems easier, so why use the first? You’ll need it if you want to assign back to your selection. For example:
cholera_merge[cholera_merge$area == 4, ]$area <- 6
# none
subset(cholera_merge, area == 4)
## [1] area subject age sex
## <0 rows> (or 0-length row.names)
# because they are here
subset(cholera_merge, area == 6)
## area subject age sex
## 16 6 4001 49 F
## 17 6 4002 29 F
## 18 6 4003 20 M
## 19 6 4004 58 F
## 20 6 4005 35 M
# doesn't work
subset(cholera_merge, area == 6)$area <- 6
## Error in subset(cholera_merge, area == 6)$area <- 6: could not find function "subset<-"
# put it back
cholera_merge[cholera_merge$area == 6, ]$area <- 4
# women from area 3
cholera_merge[cholera_merge$sex == "F" & cholera_merge$area == 3, ]
## area subject age sex
## 12 3 3002 55 F
## 13 3 3003 58 F
## 14 3 3004 60 F
Columns work the same and together with rows if you want (here I used both to keep the output short):
cholera_merge[1:3, 3:4]
## age sex
## 1 34 M
## 2 20 F
## 3 32 M
cholera_merge[1:3, c("age", "sex")]
## age sex
## 1 34 M
## 2 20 F
## 3 32 M
cholera_merge[cholera_merge$area == 4, c("age", "sex")]
## age sex
## 16 49 F
## 17 29 F
## 18 20 M
## 19 58 F
## 20 35 M
Use cbind
to “bind” columns together and rbind
to “bind” rows together (concatination of columns and rows). Let’s divide up cholera_merge
into smaller datasets. We can use a loop and the assign
function which has the same effect as <-
but works as a function (allowing us to programmatically specify the name). The function paste0
pastes together character
vectors without any space between them.
paste0("a", "b", "c")
## [1] "abc"
paste0("cholera", 1:5)
## [1] "cholera1" "cholera2" "cholera3" "cholera4" "cholera5"
assign("x", 1:3)
x
## [1] 1 2 3
for(i in 1:5) {
assign(paste0("cholera", i), cholera_merge[cholera_merge$area == i, ])
}
ls() # lists your assigned variables
## [1] "c_m" "cdsamp" "cholera_disease"
## [4] "cholera_merge" "cholera1" "cholera2"
## [7] "cholera3" "cholera4" "cholera5"
## [10] "cmer" "cmer2" "df"
## [13] "i" "letsamp" "merall"
## [16] "params" "t" "threecho"
## [19] "x"
cholera1
## area subject age sex
## 1 1 1001 34 M
## 2 1 1002 20 F
## 3 1 1003 32 M
## 4 1 1004 34 F
## 5 1 1005 57 M
cholera2
## area subject age sex
## 6 2 2001 23 M
## 7 2 2002 50 F
## 8 2 2003 74 F
## 9 2 2004 47 M
## 10 2 2005 40 F
Now put it back together:
c_m <- rbind(cholera1, cholera2, cholera3, cholera4, cholera5)
c_m[1:6, ]
## area subject age sex
## 1 1 1001 34 M
## 2 1 1002 20 F
## 3 1 1003 32 M
## 4 1 1004 34 F
## 5 1 1005 57 M
## 6 2 2001 23 M
NROW(c_m) == NROW(cholera_merge)
## [1] TRUE
Do something strange with the columns:
cbind(cholera1[, 1:2], cholera2[, 3:4])
## area subject age sex
## 1 1 1001 23 M
## 2 1 1002 50 F
## 3 1 1003 74 F
## 4 1 1004 47 M
## 5 1 1005 40 F
What if you cbind
them a few of the little datasets together?
# extra set of parentheses prints the assignment
(threecho <- cbind(cholera1, cholera2, cholera3))
## area subject age sex area subject age sex area subject age sex
## 1 1 1001 34 M 2 2001 23 M 3 3001 20 M
## 2 1 1002 20 F 2 2002 50 F 3 3002 55 F
## 3 1 1003 32 M 2 2003 74 F 3 3003 58 F
## 4 1 1004 34 F 2 2004 47 M 3 3004 60 F
## 5 1 1005 57 M 2 2005 40 F 3 3005 40 M
# hmmm... variables with same name
names(threecho)
## [1] "area" "subject" "age" "sex" "area" "subject" "age"
## [8] "sex" "area" "subject" "age" "sex"
# guess what you can assign directly to names
names(threecho) <- paste0(c("area", "subject", "age", "sex"), rep(1:3, each = 4))
threecho
## area1 subject1 age1 sex1 area2 subject2 age2 sex2 area3 subject3 age3
## 1 1 1001 34 M 2 2001 23 M 3 3001 20
## 2 1 1002 20 F 2 2002 50 F 3 3002 55
## 3 1 1003 32 M 2 2003 74 F 3 3003 58
## 4 1 1004 34 F 2 2004 47 M 3 3004 60
## 5 1 1005 57 M 2 2005 40 F 3 3005 40
## sex3
## 1 M
## 2 F
## 3 F
## 4 F
## 5 M
Merging is the way you combine datasets with common variable names. Take cholera_disease
as an example:
cholera_disease[1:5, ]
## id disease
## 1 1001 0
## 2 1002 0
## 3 1003 0
## 4 1004 0
## 5 1005 1
It shares the subject
with cholera_merge
so we can merge them together:
cmer <- merge(cholera_merge, cholera_disease)
cmer[1:10, ]
## area subject age sex id disease
## 1 1 1001 34 M 1001 0
## 2 1 1002 20 F 1001 0
## 3 1 1003 32 M 1001 0
## 4 1 1004 34 F 1001 0
## 5 1 1005 57 M 1001 0
## 6 2 2001 23 M 1001 0
## 7 2 2002 50 F 1001 0
## 8 2 2003 74 F 1001 0
## 9 2 2004 47 M 1001 0
## 10 2 2005 40 F 1001 0
R automatically determines the common columns if their names are the same, but what if they are not? You’ll have to use some of the optional arguments to the function.
names(cholera_disease) <- c("id", "disease")
cmer2 <- merge(cholera_merge, cholera_disease)
# what happened?
NROW(cmer)
## [1] 625
NROW(cmer2)
## [1] 625
What happened? Since there were no common columns R created the merge by combining every observation in the first data.frame
with each observation in the second: \(25 x 25 = 625\). So, tell R what to merge on (the first dataset is x
and the second is y
in the naming convention):
NROW(merge(cholera_merge, cholera_disease, by.x = "subject", by.y = "id"))
## [1] 25
What if observations are missing in one of the data.frames
?
set.seed(596) # because sample is random
(cdsamp <- cholera_disease[sample(1:NROW(cholera_disease), 5), ])
## id disease
## 10 2005 1
## 25 5005 0
## 17 4002 1
## 13 3003 1
## 12 3002 1
merge(cholera_merge, cdsamp, by.x = "subject", by.y = "id")
## subject area age sex disease
## 1 2005 2 40 F 1
## 2 3002 3 55 F 1
## 3 3003 3 58 F 1
## 4 4002 4 29 F 1
## 5 5005 5 77 F 0
Holy disappearing rows, Batman! Merge only keeps the matches. Be careful with that fact and be sure to specify another option if you want to keep all the rows of one:
(merall <- merge(cholera_merge, cdsamp, by.x = "subject", by.y = "id", all.x = TRUE))
## subject area age sex disease
## 1 1001 1 34 M NA
## 2 1002 1 20 F NA
## 3 1003 1 32 M NA
## 4 1004 1 34 F NA
## 5 1005 1 57 M NA
## 6 2001 2 23 M NA
## 7 2002 2 50 F NA
## 8 2003 2 74 F NA
## 9 2004 2 47 M NA
## 10 2005 2 40 F 1
## 11 3001 3 20 M NA
## 12 3002 3 55 F 1
## 13 3003 3 58 F 1
## 14 3004 3 60 F NA
## 15 3005 3 40 M NA
## 16 4001 4 49 F NA
## 17 4002 4 29 F 1
## 18 4003 4 20 M NA
## 19 4004 4 58 F NA
## 20 4005 4 35 M NA
## 21 5001 5 39 M NA
## 22 5002 5 27 M NA
## 23 5003 5 49 M NA
## 24 5004 5 63 M NA
## 25 5005 5 77 F 0
merall[20:25, ]
## subject area age sex disease
## 20 4005 4 35 M NA
## 21 5001 5 39 M NA
## 22 5002 5 27 M NA
## 23 5003 5 49 M NA
## 24 5004 5 63 M NA
## 25 5005 5 77 F 0
All the rows from the x
data.frame
are there with missing data from the y
one represented by NA
. Most frequently you want to use one or both of the all options which will save you a lot of heartache when you are merging datasets that you do not realize are incompatible.
Sorting is a common operation.
sort(merall$age)
## [1] 20 20 20 23 27 29 32 34 34 35 39 40 40 47 49 49 50 55 57 58 58 60 63
## [24] 74 77
But what if you want to sort by something in a data.frame
. What you would want to know is the “subset” that was ordered in a way that it returned the elements in sorted order. That’s where the order
function comes in:
merall$age
## [1] 34 20 32 34 57 23 50 74 47 40 20 55 58 60 40 49 29 20 58 35 39 27 49
## [24] 63 77
order(merall$age)
## [1] 2 11 18 6 22 17 3 1 4 20 21 10 15 9 16 23 7 12 5 13 19 14 24
## [24] 8 25
See how if you take each element of merall$age
given by order
you will get the sorted list? What is the 2nd element of merall$age
? 20
The 11th? 20
. So:
merall$age[order(merall$age)]
## [1] 20 20 20 23 27 29 32 34 34 35 39 40 40 47 49 49 50 55 57 58 58 60 63
## [24] 74 77
identical(merall$age[order(merall$age)], sort(merall$age))
## [1] TRUE
order
is more useful for data.frames
where it can be used to sort the data.frame
. Here we sort merall
by age
:
merall[order(merall$age), ]
## subject area age sex disease
## 2 1002 1 20 F NA
## 11 3001 3 20 M NA
## 18 4003 4 20 M NA
## 6 2001 2 23 M NA
## 22 5002 5 27 M NA
## 17 4002 4 29 F 1
## 3 1003 1 32 M NA
## 1 1001 1 34 M NA
## 4 1004 1 34 F NA
## 20 4005 4 35 M NA
## 21 5001 5 39 M NA
## 10 2005 2 40 F 1
## 15 3005 3 40 M NA
## 9 2004 2 47 M NA
## 16 4001 4 49 F NA
## 23 5003 5 49 M NA
## 7 2002 2 50 F NA
## 12 3002 3 55 F 1
## 5 1005 1 57 M NA
## 13 3003 3 58 F 1
## 19 4004 4 58 F NA
## 14 3004 3 60 F NA
## 24 5004 5 63 M NA
## 8 2003 2 74 F NA
## 25 5005 5 77 F 0
head
and tail
extract the first and last elements of a vector
or the first and last rows of a data.frame
(6 by default):
head(merall)
## subject area age sex disease
## 1 1001 1 34 M NA
## 2 1002 1 20 F NA
## 3 1003 1 32 M NA
## 4 1004 1 34 F NA
## 5 1005 1 57 M NA
## 6 2001 2 23 M NA
tail(merall)
## subject area age sex disease
## 20 4005 4 35 M NA
## 21 5001 5 39 M NA
## 22 5002 5 27 M NA
## 23 5003 5 49 M NA
## 24 5004 5 63 M NA
## 25 5005 5 77 F 0
You can add an optional argument to tell it how many to extract:
head(merall, 1)
## subject area age sex disease
## 1 1001 1 34 M NA
tail(merall, 2)
## subject area age sex disease
## 24 5004 5 63 M NA
## 25 5005 5 77 F 0
You can use negative numbers to extract “all but” that many:
head(merall, -10)
## subject area age sex disease
## 1 1001 1 34 M NA
## 2 1002 1 20 F NA
## 3 1003 1 32 M NA
## 4 1004 1 34 F NA
## 5 1005 1 57 M NA
## 6 2001 2 23 M NA
## 7 2002 2 50 F NA
## 8 2003 2 74 F NA
## 9 2004 2 47 M NA
## 10 2005 2 40 F 1
## 11 3001 3 20 M NA
## 12 3002 3 55 F 1
## 13 3003 3 58 F 1
## 14 3004 3 60 F NA
## 15 3005 3 40 M NA
tail(merall, -15)
## subject area age sex disease
## 16 4001 4 49 F NA
## 17 4002 4 29 F 1
## 18 4003 4 20 M NA
## 19 4004 4 58 F NA
## 20 4005 4 35 M NA
## 21 5001 5 39 M NA
## 22 5002 5 27 M NA
## 23 5003 5 49 M NA
## 24 5004 5 63 M NA
## 25 5005 5 77 F 0
Write a function named extreme
that takes a numeric vector
as input and returns a list
of two elements: the first named smallest
containing the smallest three elements in the numeric vector
and the second list
element named biggest
containing the largest three elements in the numeric vector
.
Write a function sampdf
that takes a data.frame
as an argument and single number and returns a random sample of the rows of the dataframe with the number of rows specified in the argument.
Write a loop that takes the cholera_merge
and extracts the rows containing the oldest person in each area. Hint: combine select/subset techniques, with order
and tail
. For now, just use the print
function to print out each row.
Take the last “Explore and Extend” exercise one more step and use rbind
to create a new data.frame
from the rows. On the first round of the loop, you’ll need to save the first row as the new data.frame
. On the subsequent iterations (rounds through the loop), you’ll need to rbind
to the data.frame
and save that to the data.frame
variable name you choose. Hear some branch logic that description?
for(i in 1:5) {
x <- cholera_merge[cholera_merge$area == i, ]
x <- x[order(x$age), ]
t <- tail(x, 1)
if(i == 1) {
df <- t
} else {
df <- rbind(df, t)
}
}
Then, think about how you can use this for our esoph
transformation from case-control to long format. Try out your ideas as time permits.