Reproducible Research

Beau B. Bruce, MD, PhD

Emory University

Objectives

Learning Objectives

  1. Talk about our motivation, i.e, “reproducible research”
  2. Learn the basics of Markdown and Quarto
  3. Learn how to make a basic reproducible report
  4. Learn how to make the “meat” of a table that can be used for publication
  5. Learn how to make R inline and table output look better

Part 1.
Reproducible Research

What is “reproducible” research?

  • I put it in quotes because it would be better called repeatable or replicable research
  • However, the term that has been generally used reproducible so we are “stuck with it”
  • Why make the distinction?

The 4 R’s

The 4 R’s: repeat, replicate, reproduce, reuse

Is repeatability and replicability worth having?

  • The authors distinctions are correct, but…
  • I disagree with the premise of the paper, i.e., “replicability is not worth having”
  • Instead, I suggest replicability, and especially repeatability of your own results is the foundation of reproducibility
  • Without being able to at least repeat exactly the results of your own analyses or replicate that of another group, why should you trust the results in order so that you can begin to try to reproduce them?

Why is reproducible research needed?

  • Data and their analysis are increasingly complex
  • Analysis is part of the methods!
  • Generally, methods should be reported at the level that another group can replicate the results

Why is reproducible research needed?

  • Many experiments can not be reproduced
    • Most large clinical trials
    • Recent worries in the field of psychology
  • But replication is at least a step in the right direction

First page of Ioannidis’ “Why Most Published Research Findings Are False”

Because the IOM says so in Evolution of Translational Omics

  • Data & metadata used should be made publicly available
  • The fully specified computational procedures and exact computer code used for the development of the candidate omics-based test should be made sustainably available

IOM Report Cover

Because the IOM says so in Evolution of Translational Omics

“Ideally, the computer code that is released will encompass all the steps of computational analysis, including all data preprocessing steps\(....\)

All aspects of the analysis need to transparently reported.’’

IOM Report Cover

Sounds good, but how?

  • One technique is “literate programming”

    • Concept introduced by Donald Knuth (he’s kinda a big deal)

    • Write your reports/articles as a stream of code and text

    • Needs the human/documentation language + a machine/programming language

Donald Knuth

Keep the program easy to read

Programs are meant to be read by humans, and only incidentally for computers to execute.

— Harold Abelson and Gerald Jay Sussman

SICP Book Cover

Literate programming in R

  • In R, probably the best literate programming environment is Markdown (the human language) + R (the computer language) knit
    together with the help of knitr with Quarto.
  • These slides have been made with Quarto!

Quarto Logo

Literate programming in R

Each time you build a Quarto document it starts from scratch and executes in order preventing you from making mistakes such as:

  • executing something you didn’t intend to make permanent
  • changing an object in the middle while you are working in the console that changes your results

A (tiny) challenge

Need to know a little about 3-ish computer languages, i.e.,

  • R,
  • markdown, and
  • YAML.

If you know some HTML/CSS or \(\LaTeX\), you can get even fancier.

Good news! You don’t need to know much of each, and all are fairly simple for the basics. So you will be able to build a great foundation today!

Part 2.
Markdown and Quarto

Markdown

Philosophy

“A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.”

— John Gruber (inventor of Markdown)

John Gruber B/W Photo

Inline formatting

You can create emphasis (italics), strong emphasis (bold), super-/subscripts.

*Emphasized* and **strongly emphasized**.

Water's molecular formula is H~2~O and Avogadro's 
constant is 6.022 x 10^23^ mol^-1^.

Emphasized and strongly emphasized.

Water’s molecular formula is H2O and Avogadro’s constant is 6.022 x 1023 mol-1.

Math

Beautifully typesetting math is relatively easy too if you know a little \(\LaTeX\)

The series $a + ar + ar^2 + \dots + ar^{n - 1}$ equals
$$ \sum_{k = 0}^{n - 1} ar^k = a \frac{1 - r^n}{1 - r}$$

The series \(a + ar + ar^2 + \dots + ar^{n - 1}\) equals \[ \sum_{k = 0}^{n - 1} ar^k = a \frac{1 - r^n}{1 - r}\]

Code

And you can include blocks of code/algorithms by placing 4 spaces or one tab in front of each line:

    # Step 1
    printf("Hello world!")
    # Step 2
    exit()
# Step 1
printf("Hello world!")
# Step 2
exit()

Code

But most often you can put code in fenced blocks using triple backticks (```)

```
# Step 1
printf("Hello world!")
# Step 2
exit()
```
# Step 1
printf("Hello world!")
# Step 2
exit()

Headers

Two styles - style 1

```
Level1
======

Hello

Level2
------

Hi!
```

Level 1

Hello!

Level 2

Hi!

Headers

Two styles - style 2

```
# Level1
A
## Level2
### Level3
B
#### Level4
```

Level 1

A

Level 2

Level 3

B

Level 4

Paragraphs

Separate by one or more blank lines. Newlines are otherwise treated as spaces allowing you to wrap your text as you like.

Paragraph 1 text. Ipsum dolor sit amet, consectetur adipiscing elit.

Paragraph 
2
text. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Paragraph 1 text. Ipsum dolor sit amet, consectetur adipiscing elit.

Paragraph 2 text. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Block quotes and smart punctuation

> "Snakes. Why did it
> have to be snakes?" 

---Indiana Jones

An en-dash is used for ranges of numbers: 2--5 pages.

“Snakes. Why did it have to be snakes?”

—Indiana Jones

An en-dash is used for ranges of numbers: 2–5 pages.

Line breaks

Preserves line breaks like this.

| The limerick packs laughs anatomical
| In space that is quite economical.
|     But the good ones I've seen
|     So seldom are clean
| And the clean ones so seldom are comical
The limerick packs laughs anatomical
In space that is quite economical.
    But the good ones I’ve seen
    So seldom are clean
And the clean ones so seldom are comical

Unordered Lists

Four space or one tab rule to nest.

* first item
* second item
    * nested item
    + doesn't matter what you start with
- third item
  • first item
  • second item
    • nested item
    • doesn’t matter what you start with
  • third item

Ordered list

The numbers don’t matter (except the first tells it where to start). Nest by indenting four spaces.

25.  Twenty-fifth
8.  Twenty-sixth
    1.  First nested
    85.  Second nested
3.  Twenty-seventh
  1. Twenty-fifth
  2. Twenty-sixth
    1. First nested
    2. Second nested
  3. Twenty-seventh

Ordered list

But try to be friendly to those reading it in text form or use this alternate form, i.e.,

#. First item
#. Second item
    #. Nested first
    #. Nested second
  1. First item
  2. Second item
    1. Nested first
    2. Nested second

Definition list

Attribution
 ~  *n.*, The act of attributing, especially the act of establishing a particular person as the creator of a work of art.
 ~  *n.*, Something, such as a quality or characteristic, that is related to a particular possessor; an attribute.

Definition
 ~ *n.*, A statement conveying fundamental character.
 ~ *n.*, A statement of the meaning of a word, phrase, or term, as in a dictionary entry.
 ~ *n.*, The act or process of stating a precise meaning or significance; formulation of a meaning.
Attribution
n., The act of attributing, especially the act of establishing a particular person as the creator of a work of art.
n., Something, such as a quality or characteristic, that is related to a particular possessor; an attribute.
Definition
n., A statement conveying fundamental character.
n., A statement of the meaning of a word, phrase, or term, as in a dictionary entry.
n., The act or process of stating a precise meaning or significance; formulation of a meaning.

Horizontal rules

A line containing a row of three or more *, -, or _ characters (optionally separated by spaces) produces a horizontal rule:

*  *  *  *

---------------


Cross-references

Various items can be cross-referenced including headers, figures, tables, equations, and code blocks.

### The main section {#sec-main}

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 

### A later section 

Varius congue suscipit placerat enim ante id sollicitudin. Enim congue vulputate nisi aptent rhoncus nec.

The main section

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

A later section

Varius congue suscipit placerat enim ante id sollicitudin. Enim congue vulputate nisi aptent rhoncus nec.

Cross-references

Various items can be cross-referenced including headers, figures, tables, equations, and code blocks.

### An even later section

Felis praesent sollicitudin vulputate tellus finibus praesent varius quis diam erat litora phasellus. 

### The final section?

Yes, good stuff is here too, but maybe you want to go back to
[the main section](#sec-main) and review.

An even later section

Felis praesent sollicitudin vulputate tellus finibus praesent varius quis diam erat litora phasellus.

The final section?

Yes, good stuff is here too, but maybe you want to go back to the main section and review.

Images

![The R Logo](https://www.r-project.org/logo/Rlogo.svg)

The R Logo

Images

Size it with a little help from CSS (no space around equals sign!)

![The R Logo](https://www.r-project.org/logo/Rlogo.svg){width=25%}

The R Logo

Footnotes

Here is a footnote reference,[^1] and another.[^longnote]

[^1]: Here is the footnote.

[^longnote]: Here's one with multiple blocks.

    Subsequent paragraphs are indented to show that they belong to the previous footnote.

    The whole paragraph can be indented, or just the first line.

This paragraph won't be part of the note, because it isn't indented.

Here is a footnote reference,1 and another.2

This paragraph won’t be part of the note, because it isn’t indented.

Inline notes

Here is an inline note.^[Inlines notes are easier to write, since
you don't have to pick an identifier and move down to type the
note.]

Here is an inline note.1

Tables

Several formats available. Painful to type yourself (so don’t! - see the knitr part later!)

+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| Bananas       | $1.34         | - bright color     |
+---------------+---------------+--------------------+
| Oranges       | $2.10         | - cures scurvy     |
+---------------+---------------+--------------------+
Fruit Price Advantages
Bananas $1.34
  • bright color
Oranges $2.10
  • cures scurvy

Tables

Several formats available. Painful to type yourself (so don’t! - see the knitr part later!)

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |
Right Left Default Center
12 12 12 12
123 123 123 123
1 1 1 1

And a lot, lot more!

  • Bibliographic citations using BibTEX, EndNote, etc.
  • Special markup for slides
    • Incremental lists
    • Pauses
    • Speaker notes
  • Raw HTML, \(\TeX\), \(\LaTeX\) can be included depending on the target document
  • Additional features and subtlety to many things we covered

Part 3.
A basic reproducible report in Quarto

Quarto document (.qmd)

A Quarto document has

  • A YAML header,
  • markdown text for the narrative with code blocks embedded containing
  • R code to display and execute.

The YAML header

Contains metadata that becomes titles, headings, etc. and controls output formatting. (Hover over the circled numbers for more detail.)

---

title: "Quarto documents (finally!)"
author: "Beau B. Bruce"
format:
    html:
        embed-resources: true
engine: knitr
---
1
This nested structure in format: specifies that we want to produce an HTML document with embedded resources which means that you can send the file to someone else, e.g., by e-mail, and it will still work.
2
engine: knitr specifies the use of the knitr engine for code chunks.

The YAML header

Some formatting flexibility is available, but generally you have to live with what it produces unless you want to get into HTML/CSS customization writing special formatters in yet another programming language.

---

title: |
       | This is a long title with
       | just the line breaks
       | that I want
author: "[Beau B. Bruce, MD, PhD](mailto:lue7@cdc.gov)"
abstract: This report details a few key things.

date: May 28, 2017

format: 

    html: 

        embed-resources: true

engine: knitr

---
1
Do these line breaks look familiar?
2
You can include links.

The YAML header

---
title: "R markdown / knitr (finally!)"
author: "Beau B. Bruce"
format:
  html:
    theme: united
    toc: true
    embed-resources: true
engine: knitr
---

# Section 1

Some text

# Section 2

Some more text
1
Add a theme, united here.
2
Add a table of contents that by default floats alongside the document.

R code blocks

Now, start to combine text and R code. (No YAML header this time just to help us focus on the R code blocks.)

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}
data(mtcars)
mtcars$mpg
```
1
The R code block starts with ```{r}. The r indicates the language.
2
Write R code as you normally would here.

Chunk attributes: echo

Chunk attributes control many features of the display. For example, you can control whether the code is shown or not.

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}

#| echo: false
data(mtcars)

mtcars$mpg

```
1
echo: false makes the code invisible in the output.

Chunk attributes: comment

Or what is used to comment the results or if it is commented at all.

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}

#| comment: "#"
data(mtcars)

mtcars$mpg

```
1
comment = "#" adds these characters before each line of output.

Chunk attributes: eval

Show the code, but do not run it.

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}

#| eval: false
data(mtcars)

mtcars$mpg

```
1
eval: false prevents the code from being run.

Chunk attributes: include

You can also do neither, but still run the code which is very useful for setup/configuration.

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}

#| include: false
data(mtcars)

mtcars$mpg

```
1
include: false runs the code, but shows neither the code nor its output.

Chunk attributes: name

Blocks can be named and this is most useful for locating where errors occur.

# Motor Trend Car Road Tests



The following is a list of the miles per gallon for

each of the vehicles included in the dataset.



```{r}

#| name: setup

#| include: false

data(mtcars)

mtcars$mpg

```

R code blocks

Plotting just works… most of the time, but there are chunk attributes for when it doesn’t (e.g., figure height/width, format, etc.). Note that if we had already used mtcars as we did in the blocks above, we would not have to call data(mtcars). The document works as a single code file, which is one way that it creates reproducible results.

# Motor Trend Car Road Tests



Miles per gallon *vs* number of cylinders.



```{r plot}

data(mtcars)

plot(mpg ~ cyl, mtcars)

```

Global options

You can set global options in the YAML header for the most common way that you want each R code block run. You can override those global options just by changing any at the relevant block as we were doing before.

---
execute:
  echo: false
  warning: false
  message: false
  fig-width: 12
  fig-height: 8
format:
  html:
    embed-resources: true
engine: knitr
---

Read more about chunk options at https://quarto.org/docs/output-formats/html-code.html.

Inline R code

Many times you do not want to have raw R output in your files especially if they are for public consumption. One way to deal with that is inline R code.

---

engine: knitr

---



The cars had an average miles 

per gallon of `{r} mean(mtcars$mpg)`.

However, this output is not really ideal either and in a bit we will see how to make it better. Before we do that, let’s see how to create tables too.

Part 4.
Automatically building tables for publication

Tables

There are several ways to make tables. One of the easiest is knitr::kable because you already have knitr available.

---

engine: knitr

format:                                  

    html:                                

        embed-resources: true

execute: 

  echo: false

---



```{r}

#| results: asis

knitr::kable(mtcars)

```

Tables

If knitr::kable does not meet your needs, you can explore other packages, such as gt, dt, or htmlTable.

Most often use their output options to produce HTML tables in output: asis blocks just like we did with knitr::kable.

The final product

  • Unlike most webpages, everything is included in your final file: images, special formatting, etc. if you use the embed-resources: true option.
  • You can e-mail them to others and they should work with nothing else! (old Internet Explorer sometimes excluded…)

Part 5.
Making output look better with sprintf

What is sprintf?

  • sprintf is an R function for text formatting
  • Name and format specifications used are based on the C programming language conventions
  • However, these formats are widely used in many programming languages, e.g., Python, etc.
  • sprintf is useful for making your R output more user friendly for inline text and tables.

Example

nums <- 1:10
srts <- sqrt(nums)
sprintf("%.2f", srts)
 [1] "1.00" "1.41" "1.73" "2.00" "2.24" "2.45" "2.65" "2.83" "3.00" "3.16"
sprintf("%.1f", srts)
 [1] "1.0" "1.4" "1.7" "2.0" "2.2" "2.4" "2.6" "2.8" "3.0" "3.2"

The structure of a format

Starts with a percent sign % and ends with a letter (case sensitive)

  • The most useful ending letters are
    • d: integer,
    • f: fixed point decimal notation,
    • e or E: scientific notation,
      • [-]m.ddde[+-]xx or
      • [-]m.dddE[+-]xx, and
    • s: character string
  • Note that you can still use s with integers (R will convert)

Example 2

sprintf("%d: (%.2f)", nums, srts)
 [1] "1: (1.00)"  "2: (1.41)"  "3: (1.73)"  "4: (2.00)"  "5: (2.24)" 
 [6] "6: (2.45)"  "7: (2.65)"  "8: (2.83)"  "9: (3.00)"  "10: (3.16)"
sprintf("%2s: (%.2f)", nums, srts)
 [1] " 1: (1.00)" " 2: (1.41)" " 3: (1.73)" " 4: (2.00)" " 5: (2.24)"
 [6] " 6: (2.45)" " 7: (2.65)" " 8: (2.83)" " 9: (3.00)" "10: (3.16)"

Inside the % and letter

No space between the % and letter means the default - strings as is - 6 decimal places for numbers

sprintf("%s", "supercalifragilisticexpialidocious")
[1] "supercalifragilisticexpialidocious"
sprintf("%f", 2.3)
[1] "2.300000"

Inside the % and letter

Place a number for the field width in characters, only pads with spaces; if things exceed width, they will will not be truncated.

sprintf("%5s", c("adamant", "list", "do"))
[1] "adamant" " list"   "   do"  
sprintf("%10f", c(2.3, 1231.459898001)) # default 6 decimal places
[1] "  2.300000"  "1231.459898"
sprintf("%2f", c(2.3, 1231.459898001))
[1] "2.300000"    "1231.459898"

Inside the % and letter

If f then use .<num> to specify the number of decimal places.

sprintf("%10.2f", c(2.3, 1231.459898001)) # default 6 decimal places
[1] "      2.30" "   1231.46"
sprintf("%2.2f", c(2.3, 1231.459898001))
[1] "2.30"    "1231.46"

Inside the % and letter

Use - to left justify within the width. Any order is ok.

sprintf("%-5s", c("adamant", "list", "do"))
[1] "adamant" "list "   "do   "  
sprintf("%10.2-f", c(2.3, 1231.459898001)) # default 6 decimal places
[1] "2.30      " "1231.46   "
sprintf("%-2.2f", c(2.3, 1231.459898001))
[1] "2.30"    "1231.46"

Inside the % and letter

Use + to always add signs to numbers

sprintf("%+10.2-f", c(-2.3, 1231.459898001)) # default 6 decimal places
[1] "-2.30     " "+1231.46  "
sprintf("%+.2f", c(2.3, -1231.459898001))
[1] "+2.30"    "-1231.46"
sprintf("%+.2e", c(2.3, -1231.459898001))
[1] "+2.30e+00" "-1.23e+03"

Inside the % and letter

Use 0 (zero) to zero pad. Note however this may not work on all platforms, but has on all that I have tried (Windows, Mac, Linux).

sprintf("%010.2f", c(-2.3, 1231.459898001)) # default 6 decimal places
[1] "-000002.30" "0001231.46"
sprintf("%05.2f", c(2.3, -1231.459898001))
[1] "02.30"    "-1231.46"
sprintf("%08d", 1:5)
[1] "00000001" "00000002" "00000003" "00000004" "00000005"

How to get a percent sign and getting fancier

Use %% to get a literal percent sign

sprintf("%d/%d (%.2f%%)", 2, 3, 2/3 * 100)
[1] "2/3 (66.67%)"
x <- 1:10
y <- x + 2
sprintf("%d/%d (%.0f%%)", x, y, x/y * 100)
 [1] "1/3 (33%)"   "2/4 (50%)"   "3/5 (60%)"   "4/6 (67%)"   "5/7 (71%)"  
 [6] "6/8 (75%)"   "7/9 (78%)"   "8/10 (80%)"  "9/11 (82%)"  "10/12 (83%)"

rprintf

But wait there’s more!

If you buy sprintf in the next 10 minutes, I’ll throw in rprintf for free!

rprintf - works just like sprintf BUT…

# install.packages("rprintf")
library(rprintf)
rprintf("%d/%d (%.0f%%)", x, y, x/y * 100)
 [1] "1/3 (33%)"   "2/4 (50%)"   "3/5 (60%)"   "4/6 (67%)"   "5/7 (71%)"  
 [6] "6/8 (75%)"   "7/9 (78%)"   "8/10 (80%)"  "9/11 (82%)"  "10/12 (83%)"

rprintf - gives you named formats

rprintf('$a, $b:.1f, $c:+.2f, $b, $a:.0f',a=1.56,b=2.34,c=3.78)
[1] "1.56, 2.3, +3.78, 2.34, 2"
data(esoph)
head(esoph)
  agegp     alcgp    tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day      0        40
2 25-34 0-39g/day    10-19      0        10
3 25-34 0-39g/day    20-29      0         6
4 25-34 0-39g/day      30+      0         5
5 25-34     40-79 0-9g/day      0        27
6 25-34     40-79    10-19      0         7
rprintf("$agegp, $ncontrols:02d", head(esoph))
[1] "25-34, 40" "25-34, 10" "25-34, 06" "25-34, 05" "25-34, 27" "25-34, 07"

rprintf - gives you a numbering mechanism

rprintf("{2:.1f}, {1:.2f}", 1.56, 2.34)
[1] "2.3, 1.56"

rprintf - mix and match

rprintf(c(a='%s:%d',b='$name:$age',c='{1}:{2}'),name='Ken',age=24)
       a        b        c 
"Ken:24" "Ken:24" "Ken:24" 

Putting it all together

Check out the Basic Table Example

Advanced example

Bam!!!

Check out the Advanced Table Example

For more information