Getting Data In & Out of R

Prepare

This is not a typical module. There is only a brief tutorial on getting datasets in and out of R as well as some file system related commands.

Excel files

Next, install the package readxl which allows you to read Excel files:

install.packages('readxl')  # only need to do this once per installed version of R

To read an Excel file into R you load the xlsx library once and use the read.xlsx function.

library(readxl)  # only once per session

Because I’ve saved the files we are going to use in the adm package you need to ask R where the package directory is (your answer will be different depending on your operating system and what version of R you are running, etc.). These steps are unique for accessing a raw data file contained in a package. You will NOT generally do this, but it is a convenient way to distribute the file to you.
Once you discover where the file is however, the rest of the steps are the same.

system.file(package = "adm")

## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/adm"

The Excel file we are going to use is a version of the esoph dataset that is in the extdata directory of the package. If you get a blank character, you have typed something wrong or you adm package is not up-to-date.

system.file("extdata/esoph.xlsx", package = "adm")

## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/adm/extdata/esoph.xlsx"

With our filename in hand, we can now call the read.xlsx function. The 1 tells it to pull out the first sheet of the file:

read_excel(system.file("extdata/esoph.xlsx", package = "adm"), 1)

##    agegp     alcgp    tobgp ncases ncontrols
## 1  25-34 0-39g/day 0-9g/day      0        40
## 2  25-34 0-39g/day    10-19      0        10
## 3  25-34 0-39g/day    20-29      0         6
## 4  25-34 0-39g/day      30+      0         5
## 5  25-34     40-79 0-9g/day      0        27
## 6  25-34     40-79    10-19      0         7
## 7  25-34     40-79    20-29      0         4
## 8  25-34     40-79      30+      0         7
## 9  25-34    80-119 0-9g/day      0         2
## 10 25-34    80-119    10-19      0         1
## 11 25-34    80-119      30+      0         2
## 12 25-34      120+ 0-9g/day      0         1
## 13 25-34      120+    10-19      1         1
## 14 25-34      120+    20-29      0         1
## 15 25-34      120+      30+      0         2
## 16 35-44 0-39g/day 0-9g/day      0        60
## 17 35-44 0-39g/day    10-19      1        14
## 18 35-44 0-39g/day    20-29      0         7
## 19 35-44 0-39g/day      30+      0         8
## 20 35-44     40-79 0-9g/day      0        35
## 21 35-44     40-79    10-19      3        23
## 22 35-44     40-79    20-29      1        14
## 23 35-44     40-79      30+      0         8
## 24 35-44    80-119 0-9g/day      0        11
## 25 35-44    80-119    10-19      0         6
## 26 35-44    80-119    20-29      0         2
## 27 35-44    80-119      30+      0         1
## 28 35-44      120+ 0-9g/day      2         3
## 29 35-44      120+    10-19      0         3
## 30 35-44      120+    20-29      2         4
## 31 45-54 0-39g/day 0-9g/day      1        46
## 32 45-54 0-39g/day    10-19      0        18
## 33 45-54 0-39g/day    20-29      0        10
## 34 45-54 0-39g/day      30+      0         4
## 35 45-54     40-79 0-9g/day      6        38
## 36 45-54     40-79    10-19      4        21
## 37 45-54     40-79    20-29      5        15
## 38 45-54     40-79      30+      5         7
## 39 45-54    80-119 0-9g/day      3        16
## 40 45-54    80-119    10-19      6        14
## 41 45-54    80-119    20-29      1         5
## 42 45-54    80-119      30+      2         4
## 43 45-54      120+ 0-9g/day      4         4
## 44 45-54      120+    10-19      3         4
## 45 45-54      120+    20-29      2         3
## 46 45-54      120+      30+      4         4
## 47 55-64 0-39g/day 0-9g/day      2        49
## 48 55-64 0-39g/day    10-19      3        22
## 49 55-64 0-39g/day    20-29      3        12
## 50 55-64 0-39g/day      30+      4         6
## 51 55-64     40-79 0-9g/day      9        40
## 52 55-64     40-79    10-19      6        21
## 53 55-64     40-79    20-29      4        17
## 54 55-64     40-79      30+      3         6
## 55 55-64    80-119 0-9g/day      9        18
## 56 55-64    80-119    10-19      8        15
## 57 55-64    80-119    20-29      3         6
## 58 55-64    80-119      30+      4         4
## 59 55-64      120+ 0-9g/day      5        10
## 60 55-64      120+    10-19      6         7
## 61 55-64      120+    20-29      2         3
## 62 55-64      120+      30+      5         6
## 63 65-74 0-39g/day 0-9g/day      5        48
## 64 65-74 0-39g/day    10-19      4        14
## 65 65-74 0-39g/day    20-29      2         7
## 66 65-74 0-39g/day      30+      0         2
## 67 65-74     40-79 0-9g/day     17        34
## 68 65-74     40-79    10-19      3        10
## 69 65-74     40-79    20-29      5         9
## 70 65-74    80-119 0-9g/day      6        13
## 71 65-74    80-119    10-19      4        12
## 72 65-74    80-119    20-29      2         3
## 73 65-74    80-119      30+      1         1
## 74 65-74      120+ 0-9g/day      3         4
## 75 65-74      120+    10-19      1         2
## 76 65-74      120+    20-29      1         1
## 77 65-74      120+      30+      1         1
## 78   75+ 0-39g/day 0-9g/day      1        18
## 79   75+ 0-39g/day    10-19      2         6
## 80   75+ 0-39g/day      30+      1         3
## 81   75+     40-79 0-9g/day      2         5
## 82   75+     40-79    10-19      1         3
## 83   75+     40-79    20-29      0         3
## 84   75+     40-79      30+      1         1
## 85   75+    80-119 0-9g/day      1         1
## 86   75+    80-119    10-19      1         1
## 87   75+      120+ 0-9g/day      2         2
## 88   75+      120+    10-19      1         1

Hopefully, it looks familiar. Of course you can save it for later:

xesoph <- read_excel(system.file("extdata/esoph.xlsx", package = "adm"), 1)

read_excel automatically reads in the data.frame as though you called data.frame with stringsAsFactors = FALSE. In general this is the best way to import any data due to the problems you have seen when you start with factor and want to do additional data manipulations.
We will soon learn ways to convert variables to factors nearly automatically anyway.

Comma Delimited Text Files

Comma delimited text or comma separated values (CSV) are a common text file format for representing data. You can use the builtin read.csv to read CSV files:

iih <- read.csv(system.file("extdata/iih.csv", package = "adm"), stringsAsFactors = FALSE)

Excel can open and write CSV too.

File system commands

R is working in a directory of your file system. You can figure out which one using getwd which means “get working directory”:

getwd()

You can change this directory using setwd:

setwd("/Users/bbbruce/Documents")  # you need to modify this for your system and preferences

Since you may not be familiar with your file system, I recommend using RStudio’s menu. Go to Session -> Set Working Directory -> Choose Directory… so that you can browse to where you want to work. When you are in the working directory, you do not need any of the fancy commands we were using earlier to figure out where the file is.

For example, if you had a file called “iih.xlsx” in your working directory you would read it in like this:

read_excel("iih.xlsx", 1, stringsAsFactors = FALSE)

Likewise, if you write anything out, it will be in the working directory unless you tell R otherwise. So if you cannot find something you need make sure to check what directory you were working in!

Writing data files

There are write functions for CSV (which you can open in Excel easily):

write.csv(iih, "iih2.csv")

These will be written into your working directory. Be careful, if they exist, they will be overwritten.

You can also directly save R objects:

saveRDS(xesoph, file = "xesoph.rds")   
saveRDS(iih, file = "iih.rds")  

rm(xesoph)
rm(iih)

xesoph <- readRDS("xesoph.rds")      
iih <- readRDS("iih.rds")

I often use these functions in conjunction with file.exists to cache long data operations so that I don’t have to wait for them to finish everytime I run my code.