Reproducible notebooks for data analysis

Overview

Teaching: 60 min
Exercises: 120-150 min

Questions

Should I use a graphical user interface to analyse data or a code-based system?

What is literate programming and what is R Markdown?

How do I use R Markdown

Objectives

Understand the advantages of code-based data analysis

Able to create and adapt an R Markdown script

Adapt a YAML header

Use code chunks and choose the right chunk options

Practice R Markdown by answering simple questions

Bonus: add a table to an R Markdown script

1. Reproducibility of analysis code

Graphical user interface (GUI) vs. code

Statistical analysis software packages with graphical user interfaces such as SPSS (often) provide an easy solution to data analysis because

one can get started with an analysis without much overhead in having to learn to code
it is a convenient way of exploring the data
figures and the statistical analysis are done at the end of the exploration

But it is often not clear anymore how exactly (and with which commands) those results were obtained and writing all steps down can be very tedious. Even if some kind of history of the executed commands is saved it might still be necessary to clean this up and keep only the relevant steps..

An analysis that is not reproducible can be an issue since

it is easy to unintentionally introduce errors
it is hard to get rid of errors
if you need to change some step in the middle of the analysis you have to restart from scratch
you can not provide a workflow that others (or yourself in the future) can follow easily to reproduce the results

On the other hand doing a statistical analysis purely with code is

a lot of effort since all steps have to be written down explicitly
needs some initial time investment in case of unfamiliarity

However, while it seems like a big hurdle modern programming languages designed for statistical computing, such as R, are usually pretty straightforward to learn and use, and they have a lot of advantages:

Flexible: more combinations, functions and extensions are available
Extensible (packages): in R, for example, thousands of packages are available.
Reproducible (easily rerun from scratch)
Mixing code and text increases readability
Much more complete account of analysis than a methods section
Start new projects from own or otherwise available code for similar analyses

R projects

Another very useful concept that enhances R’s usefulness via Rstudio are R Projects. They allow to

easily implement the folder structure that we heard about in the “First steps towards more reproducibility” episode (Organisation and software)
communicate between these directories because the correct working directory is set automatically
simplify collaborations since everybody can put their projects at desired location in their folder tree
quickly jump between different projects by remembering which files you had open and reopening them again
have more than one project open without mingling assigned variables etc.

Connected to the use of Project is the concept that you use relative file paths (e.g. for loading csv files). So instead of doing something like this: read.csv("/home/user/Documents/Uni/UnderstandingORS/Topic_5/data/example.csv") you write read.csv("data/example.csv"). This is easier to write, more flexible and less prone to errors because as long as you keep your files in the project together it will work. Imagine, for example, you want to move your script (and data) to "/home/user/backup/Uni/UnderstandingORS/Topic_5/data/example.csv".

Quiz on R projects

File path

Suppose your current working directory is ~/project and you want to specify the relative path to the file ~/project/data/data.csv . What are possible specifications?

data.csv

project/data.csv

project/data/data.csv

data/data.csv

Solution

F data.csv
F project/data.csv
F project/data/data.csv
T data/data.csv

2. Literate programming and R Markdown

R Markdown is a realisation of the literate programming concept mixing narrative text with analysis code which is then rendered into formatted text and analysis results (numbers, tables and graphics). The concept of literate programming goes back to Donald Knuth, see e.g. from the open-science-with-r carpentries course:

_ More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm: _

Do not write your program only for R but think also of code readers (that includes you).

Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.

Explain the “why” and not the “how”.

Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.

Parts of an R Markdown (.Rmd) file

Create a new Rmd

Execute the following steps on your computer while you read:

In Rstudio

go to File > New File > R Markdown
enter Title and Author
select html as output
confirm

A new .Rmd file should open with a short tutorial.

To render, or knit, the file to html press Knit. The first time you run the script you will have to specify the name under which to save it. Afterwards the script is always saved before rendering.

YAML header

The YAML header of an R Markdown files contains the meta data that influence the final document in different ways. See the short summary from the open-science-with-r carpentries course:

_ The header of your R Markdown document will allow you to personalize the related report from your R Markdown document. The header follows the YAML syntax (“YAML Ain’t Markup Language”) which usually follows a key:value syntax. _

For example, title: "titlename", where title is the key and "titlename" is the value.

The header itself starts with --- and stops with ---. For example:

---
title: "titlename"
output: html_output
---

More information about the YAML header can be found in the R Markdown cheat sheet.

Code chunks

The narrative text of a report is written in the simple Markdown syntax. Code chunks are specific to R Markdown. They contain R code that is to be executed when rendering the chunk or the entire file, i.e. including the data analysis.

To start a chunk write (backticks) ` {r} `, then place your R code and end the chunk with ` . The r in ```{r} ` indicates that the programming language used in this chunk is R. Other options include python or bash although we will not need these here.

Within RStudio a new code chunk can be included by either clicking on Insert a new code chunk in the toolbar or using a keyboard shortcut (Ctrl+Alt+I on Windows and Option+Command+I on Mac).

Each chunk can be run separately. To run the code in an individual chunk click on the green arrow (Run Current Chunk) on the right side of the chunk. Alternatively use the keyboard shortcut Ctrl+Alt+T (Windows) or Option+Command+T (Mac) to run the current chunk (i.e. where your cursor is located). This runs only the code in the specific chunk but does not render the entire file.

For more options see the cheat sheet in R studio: Help > Cheat Sheets > R Markdown Cheat Sheet or the link above.

The behavior of code chunks can be changed by setting chunk options. This is done in the opening of the chunk, e.g. ` ```{r, echo=FALSE}`, which hides the code of this chunk (while still evaluating it). For more options see the R Markdown cheat sheet or the R Markdown Cookbook.

Note: inline R code, i.e. code directly within the narrative text, can be run with `r ` , placing the code after r . This is for example useful when you mention a sample size in the text and want it to update directly from the data set you read in.

Quiz on literate programming and R Markdown

Literate programming

Which of the following statements about literate programming are true?

Literate programming combines code and text.

Literate programming makes your code run more efficient.

Literate programming makes your analysis easier to understand.

Code should only be shown if necessary.

Plots should not be included.

Solution

T Literate programming combines code and text.
F Literate programming makes your code run more efficient.
T Literate programming makes your analysis easier to understand.
F Code should only be shown if necessary.
F Plots should not be included.

YAML header: author

How do you specify the author in an R Markdown document?

author: “name”

Author: “name”

Author = “name”

Author: ‘name’

author: name

Solution

T author: “name”
F Author: “name”
F Author = “name”
F Author: ‘name’
T author: name

YAML header: date

How do you set the date?

Date: 01/01/2021

Date = 01/01/2021

datum: 01/01/2021

date: 01/01/2021

Solution

F Date: 01/01/2021
F Date = 01/01/2021
F datum: 01/01/2021
T date: 01/01/2021

Chunk options

How do you prevent code from evaluation in a chunk?

evaluate=FALSE

eval=FALSE

noeval=TRUE

hinder=TRUE

hind=TRUE

interpret=FALSE

inpr=FALSE

Solution

F evaluate=FALSE
T eval=FALSE
F noeval=TRUE
F hinder=TRUE
F hind=TRUE
F interpret=FALSE
F inpr=FALSE

Chunk options: figure height

How do you adjust the figure height?

figure_height=100

figureheight=100

heightfigure=100

fig_height=100

fig.height=100

height.fig=100

Solution

F figure_height=100
F figureheight=100
F heightfigure=100
F fig_height=100
T fig.height=100
F height.fig=100

R Markdown Practice

Modify the template found here by performing the following steps:

add author and date to YAML header

rename the first chunk to print_chunk_nr_1

set the chunk options of chunk 2 to not show the code

set the chunk options of chunk 3 to not evaluate the code but show it

set the chunk options of chunk 4 to not show the warning

complete the sentence at the end with appropriate information calculated in chunk 5

After these steps answer the questions:

The percentage of children who survived the Titanic accident was (Note: round to one decimal digit)

The percentage of female survivors was ___ times higher as the percentage of male survivors. (Note: round to two decimal digits)

Solution

RMarkdown solution

52.3

3.45

Episode challenge

The goal of this challenge is to create a fully reproducible analysis within an R Markdown script which is easy to understand and read. For that, describe what you do and do not forget nice formatting. For example try different Markdown syntax (Headers, …), figure captions (Hint: check the chunk options for this), a meaningful YAML header, etc.

Analysis of the palmer penguins data

Create a new R Markdown document and write the code for each of the below questions in a separate paragraph/chunk and describe the result in a complete sentence directly in the R Markdown document. We will use the penguins dataset from the package palmerpenguins, available as the data set penguins after installing the package. To get an overview of the data load the penguins dataset and explore the following questions:

Question 1

Find the the source of the penguins dataset together with the url to the data repository. Hint: run ?penguins
Solution 1
library(palmerpenguins)
Can I put ?penguins in the chunk?

The source is: Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f

Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689

Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. doi: 10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e < Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
Question 2

Create a Markdown table describing the penguins data: for each numeric column in the dataset create a row in the table which should consist of the following columns: Column_Name, Mean, Variance
Hint: Checkout the function knitr::kable and the chunk option results='asis'.
Solution 2
numericcols <- sapply(colnames(penguins), function(x) is.numeric(penguins[[x]]))
df <- data.frame(Column_Name = names(numericcols)[numericcols],
                 Mean = signif(apply((na.omit(penguins[numericcols])), 2, mean), 4),
                 Variance = signif(apply((na.omit(penguins[numericcols])), 2, var), 4),
                 row.names = NULL
                 )
knitr::kable(df)
Column_Name Mean Variance

bill_length_mm 43.92 2.981e+01

bill_depth_mm 17.15 3.900e+00

flipper_length_mm 200.90 1.977e+02

body_mass_g 4202.00 6.431e+05

year 2008.00 6.678e-01
Question 3

How many rows does the penguins dataset have?
Solution 3
result_question_3 <- dim(penguins)[1]
The data set has 344 rows.
Question 4

What is the first year of records in the data set?
Solution 4
result_question_4 <- min(penguins$year)
The first year of records is 2007.
Question 5

What is the total number of Adelie penguins?
Solution 5
result_question_5 <- sum(penguins$species == "Adelie")
The total number of Adelie penguins is 152
Question 6

What is the total number of missing values (NA)?
Solution 6
result_question_6 <- sum(is.na(penguins))
The total number of missing values (NA’s) is 19.
Question 7

What is the total number of rows with no missing values?
Solution 7
result_question_7 <- sum(apply(penguins, 1, function(x) !any(is.na(x))))
The number of complete rows (rows with no missing values, i.e. NA’s) is rresult_question_7.
Question 8

On which islands were the Gentoo penguins found?
Solution 8
result_question_8 <- unique(penguins$island[penguins$species == "Gentoo"]);
The name of islands where the Gentoo penguins were found is Biscoe.
Question 9

What is the proportion of Adelie penguins on Dream island (compared to all penguins on Dream island)?
Solution 9
result_question_9 <- sum(penguins$species == "Adelie" & 
                           penguins$island == "Dream") / 
                      sum(penguins$island == "Dream")
The proportion of Adelie penguins on Dream island is 0.4516129.
Question 10

What is the 93% quantile of the bill lengths in mm?
Solution 10
result_question_10 <- quantile(na.omit(penguins$bill_length_mm), 0.93)
The 93 % quantile of bill_length_mm is 51.3.
Question 11

What is the absolute mean difference of bill depth in mm between female and male penguins?
Solution 11
result_question_11 <- abs(coef(lm(bill_depth_mm ~ sex, penguins))[2])
The absolute mean difference of bill_depth_mm between female and male is 1.4656169.
Question 12

What is the 95% confidence interval of the slope of the linear regression with intercept of bill depth regressed on sex? Result will be a vector of two elements, e.g. c('lower_limit', 'upper_limit').
Solution 12
result_question_12 <- confint(lm(bill_depth_mm ~ sex, penguins), "sexmale" )
The 95% confidence interval of the slope of the linear regression between bill_depth_mm and sex is 1.0710254, 1.8602083
Question 13

What is the proportion of Chinstrap penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45 compared to all penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45?
Solution 13
chins <- na.omit(penguins$species[penguins$flipper_length_mm < 205 & 
                                    penguins$bill_length_mm > 45])
result_question_13 <- sum(chins == "Chinstrap") / length(chins)
The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 is 0.9310345.
Question 14

What is the proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins?
Solution 14
result_question_14 <- sum(chins == "Chinstrap") / sum(penguins$species == "Chinstrap")
The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins is 0.7941176.

Column_Name	Mean	Variance
bill_length_mm	43.92	2.981e+01
bill_depth_mm	17.15	3.900e+00
flipper_length_mm	200.90	1.977e+02
body_mass_g	4202.00	6.431e+05
year	2008.00	6.678e-01

Bonus challenge

R Markdown tables

For the following challenge we will use the package kableExtra which extends the base capabilities of knitr::kable to create tables. From the package vignette:

_ The goal of kableExtra is to help you build common complex tables and manipulate table styles. It imports the pipe %>% symbol from magrittr and verbalize all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2 and plotly._

For users who are not very familiar with the pipe operator %>% in R: it is the R version of the fluent interface. The idea is to pass the result along the chain for a more literal coding experience. Basically, when we say A %>% B, technically it means sending the results of A to B as B’s first argument.

Simple tables can be generated as follows:
library(kableExtra)
head(penguins) %>% # the dataset, '%>%' parses the output of this command as the input in the next >command 
 kbl() %>% # the kableExtra equivalent of knitr::kable, base table
 kable_classic() # add theme to table
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year

Adelie Torgersen 39.1 18.7 181 3750 male 2007

Adelie Torgersen 39.5 17.4 186 3800 female 2007

Adelie Torgersen 40.3 18.0 195 3250 female 2007

Adelie Torgersen NA NA NA NA NA 2007

Adelie Torgersen 36.7 19.3 193 3450 female 2007

Adelie Torgersen 39.3 20.6 190 3650 male 2007

For all options check the documentation or the vignette.

Task 1 Create the following table:

bill length [mm]

bill depth [mm]

species 2007 2008 2009 2007 2008 2009

Adelie 38.8 38.6 39.0 18.8 18.2 18.1

Chinstrap 48.7 48.7 49.1 18.5 18.4 18.3

Gentoo 47.0 46.9 48.5 14.7 14.9 15.3

Hint: checkout the different styling functions, e.g. kable_classic.
Hint: For multiple column names use add_header_above
Hint: Use the following code to get started.
Solution
df_sum <- penguins %>% 
  dplyr::select(-sex, -island, -flipper_length_mm, -body_mass_g) %>% 
  dplyr::group_by(species, year) %>% 
  dplyr::summarise(dplyr::across(.fns = function(x) signif(mean(na.omit(x)), 3))) %>% 
  tidyr::pivot_wider(names_from = c(year), values_from = c(bill_length_mm, bill_depth_mm)) 

df_sum %>%
  kbl(col.names = c("species", rep(c("2007", "2008", "2009"), 2)))%>%
  kable_classic() %>%
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3)) %>%
  kable_styling(bootstrap_options = c("hover")) %>% 
  column_spec (c(1, 4), border_right = T) 
Task 2 Create the following table wich includes small graphs:

bill length [mm]

bill depth [mm]

species mean boxplot histogram mean boxplot histogram

Adelie 38.8 18.3

Chinstrap 48.8 18.4

Gentoo 47.5 15.0

Hint: Use column_spec for altering specific columns.
Solution
df_sum <- penguins %>% 
  dplyr::select(-island, -sex, -year, -body_mass_g, -flipper_length_mm) %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(dplyr::across(.cols = !contains("species"),
                                 .fns = function(x) 
                                   signif(mean(na.omit(x)), 3))) %>% 
  dplyr::mutate(bill_length_boxplot = "", bill_length_hist = "",
                bill_depth_boxplot = "", bill_depth_hist = "")

dfsum_list <- split(penguins$bill_length_mm, penguins$species)
dfsum_list2 <- split(penguins$bill_depth_mm, penguins$species)
df_sum %>%
  dplyr::select(species, 
                dplyr::starts_with("bill_length"), 
                dplyr::starts_with("bill_depth")) %>% 
  kbl(col.names = c("species", rep(c("mean", "boxplot", "histogram"), 2))) %>%
  kable_paper() %>%
  column_spec(1, border_right = TRUE) %>%
  column_spec(3, image = spec_boxplot(dfsum_list)) %>%
  column_spec(4, image = spec_hist(dfsum_list), border_right = T) %>%
  column_spec(6, image = spec_boxplot(dfsum_list2)) %>% 
  column_spec(7, image = spec_hist(dfsum_list2)) %>% 
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3),
                   border_right = TRUE, border_left = TRUE)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

	bill length [mm]	bill depth [mm]
Adelie	38.8	38.6	39.0	18.8	18.2	18.1
Chinstrap	48.7	48.7	49.1	18.5	18.4	18.3
Gentoo	47.0	46.9	48.5	14.7	14.9	15.3

Key Points

Code-based analysis is better for reproducibility.

Combining narrative and code-based results is even more profitable.

Code chunks in R Markdown provide an easy solution

	bill length [mm]			bill depth [mm]
species	2007	2008	2009	2007	2008	2009
Adelie	38.8	38.6	39.0	18.8	18.2	18.1
Chinstrap	48.7	48.7	49.1	18.5	18.4	18.3
Gentoo	47.0	46.9	48.5	14.7	14.9	15.3

	bill length [mm]			bill depth [mm]
species	mean	boxplot	histogram	mean	boxplot	histogram
Adelie	38.8			18.3
Chinstrap	48.8			18.4
Gentoo	47.5			15.0

previous episode

Understanding Open and Reproducible Science

next episode

Reproducible notebooks for data analysis

Overview

1. Reproducibility of analysis code

Graphical user interface (GUI) vs. code

R projects

Quiz on R projects

File path

Solution

2. Literate programming and R Markdown

Parts of an R Markdown (.Rmd) file

Create a new Rmd

YAML header

Code chunks

Quiz on literate programming and R Markdown

Literate programming

Solution

YAML header: author

Solution

YAML header: date

Solution

Chunk options

Solution

Chunk options: figure height

Solution

R Markdown Practice

Solution

Episode challenge

Analysis of the palmer penguins data

Question 1

Solution 1

Question 2

Solution 2

Question 3

Solution 3

Question 4

Solution 4

Question 5

Solution 5

Question 6

Solution 6

Question 7

Solution 7

Question 8

Solution 8

Question 9

Solution 9

Question 10

Solution 10

Question 11

Solution 11

Question 12

Solution 12

Question 13

Solution 13

Question 14

Solution 14

Bonus challenge

R Markdown tables

Task 1 Create the following table:

Solution

Task 2 Create the following table wich includes small graphs:

Solution

Key Points

previous episode

next episode