This lesson is still being designed and assembled (Pre-Alpha version)

Reproducible notebooks for data analysis

Overview

Teaching: 60 min
Exercises: 120-150 min
Questions
  • Should I use a graphical user interface to analyse data or a code-based system?

  • What is literate programming and what is R Markdown?

  • How do I use R Markdown

Objectives
  • Understand the advantages of code-based data analysis

  • Able to create and adapt an R Markdown script

  • Adapt a YAML header

  • Use code chunks and choose the right chunk options

  • Practice R Markdown by answering simple questions

  • Bonus: add a table to an R Markdown script

1. Reproducibility of analysis code

Graphical user interface (GUI) vs. code

Statistical analysis software packages with graphical user interfaces such as SPSS (often) provide an easy solution to data analysis because

But it is often not clear anymore how exactly (and with which commands) those results were obtained and writing all steps down can be very tedious. Even if some kind of history of the executed commands is saved it might still be necessary to clean this up and keep only the relevant steps..

An analysis that is not reproducible can be an issue since

On the other hand doing a statistical analysis purely with code is

However, while it seems like a big hurdle modern programming languages designed for statistical computing, such as R, are usually pretty straightforward to learn and use, and they have a lot of advantages:

R projects

Another very useful concept that enhances R’s usefulness via Rstudio are R Projects. They allow to

Connected to the use of Project is the concept that you use relative file paths (e.g. for loading csv files). So instead of doing something like this: read.csv("/home/user/Documents/Uni/UnderstandingORS/Topic_5/data/example.csv") you write read.csv("data/example.csv"). This is easier to write, more flexible and less prone to errors because as long as you keep your files in the project together it will work. Imagine, for example, you want to move your script (and data) to "/home/user/backup/Uni/UnderstandingORS/Topic_5/data/example.csv".

Quiz on R projects

File path

Suppose your current working directory is ~/project and you want to specify the relative path to the file ~/project/data/data.csv . What are possible specifications?

  • data.csv
  • project/data.csv
  • project/data/data.csv
  • data/data.csv

Solution

F data.csv
F project/data.csv
F project/data/data.csv
T data/data.csv

2. Literate programming and R Markdown

R Markdown is a realisation of the literate programming concept mixing narrative text with analysis code which is then rendered into formatted text and analysis results (numbers, tables and graphics). The concept of literate programming goes back to Donald Knuth, see e.g. from the open-science-with-r carpentries course:

_ More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm: _

  • Do not write your program only for R but think also of code readers (that includes you).
  • Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.
  • Explain the “why” and not the “how”.
  • Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.

Parts of an R Markdown (.Rmd) file

Create a new Rmd

Execute the following steps on your computer while you read:

In Rstudio

A new .Rmd file should open with a short tutorial.

To render, or knit, the file to html press Knit. The first time you run the script you will have to specify the name under which to save it. Afterwards the script is always saved before rendering.

YAML header

The YAML header of an R Markdown files contains the meta data that influence the final document in different ways. See the short summary from the open-science-with-r carpentries course:

_ The header of your R Markdown document will allow you to personalize the related report from your R Markdown document. The header follows the YAML syntax (“YAML Ain’t Markup Language”) which usually follows a key:value syntax. _

For example, title: "titlename", where title is the key and "titlename" is the value.

The header itself starts with --- and stops with ---. For example:

---
title: "titlename"
output: html_output
---

More information about the YAML header can be found in the R Markdown cheat sheet.

Code chunks

The narrative text of a report is written in the simple Markdown syntax. Code chunks are specific to R Markdown. They contain R code that is to be executed when rendering the chunk or the entire file, i.e. including the data analysis.

To start a chunk write (backticks) ` {r} `, then place your R code and end the chunk with ` . The r in ```{r} ` indicates that the programming language used in this chunk is R. Other options include python or bash although we will not need these here.

Within RStudio a new code chunk can be included by either clicking on Insert a new code chunk in the toolbar or using a keyboard shortcut (Ctrl+Alt+I on Windows and Option+Command+I on Mac).

Each chunk can be run separately. To run the code in an individual chunk click on the green arrow (Run Current Chunk) on the right side of the chunk. Alternatively use the keyboard shortcut Ctrl+Alt+T (Windows) or Option+Command+T (Mac) to run the current chunk (i.e. where your cursor is located). This runs only the code in the specific chunk but does not render the entire file.

For more options see the cheat sheet in R studio: Help > Cheat Sheets > R Markdown Cheat Sheet or the link above.

The behavior of code chunks can be changed by setting chunk options. This is done in the opening of the chunk, e.g. ` ```{r, echo=FALSE}`, which hides the code of this chunk (while still evaluating it). For more options see the R Markdown cheat sheet or the R Markdown Cookbook.

  Note: inline R code, i.e. code directly within the narrative text, can be run with `r ` , placing the code after r . This is for example useful when you mention a sample size in the text and want it to update directly from the data set you read in.

Quiz on literate programming and R Markdown

Literate programming

Which of the following statements about literate programming are true?

  • Literate programming combines code and text.
  • Literate programming makes your code run more efficient.
  • Literate programming makes your analysis easier to understand.
  • Code should only be shown if necessary.
  • Plots should not be included.

Solution

T Literate programming combines code and text.
F Literate programming makes your code run more efficient.
T Literate programming makes your analysis easier to understand.
F Code should only be shown if necessary.
F Plots should not be included.

YAML header: author

How do you specify the author in an R Markdown document?

  • author: “name”
  • Author: “name”
  • Author = “name”
  • Author: ‘name’
  • author: name

Solution

T author: “name”
F Author: “name”
F Author = “name”
F Author: ‘name’
T author: name

YAML header: date

How do you set the date?

  • Date: 01/01/2021
  • Date = 01/01/2021
  • datum: 01/01/2021
  • date: 01/01/2021

Solution

F Date: 01/01/2021
F Date = 01/01/2021
F datum: 01/01/2021
T date: 01/01/2021

Chunk options

How do you prevent code from evaluation in a chunk?

  • evaluate=FALSE
  • eval=FALSE
  • noeval=TRUE
  • hinder=TRUE
  • hind=TRUE
  • interpret=FALSE
  • inpr=FALSE

Solution

F evaluate=FALSE
T eval=FALSE
F noeval=TRUE
F hinder=TRUE
F hind=TRUE
F interpret=FALSE
F inpr=FALSE

Chunk options: figure height

How do you adjust the figure height?

  • figure_height=100
  • figureheight=100
  • heightfigure=100
  • fig_height=100
  • fig.height=100
  • height.fig=100

Solution

F figure_height=100
F figureheight=100
F heightfigure=100
F fig_height=100
T fig.height=100
F height.fig=100

R Markdown Practice

Modify the template found here by performing the following steps:

  • add author and date to YAML header
  • rename the first chunk to print_chunk_nr_1
  • set the chunk options of chunk 2 to not show the code
  • set the chunk options of chunk 3 to not evaluate the code but show it
  • set the chunk options of chunk 4 to not show the warning
  • complete the sentence at the end with appropriate information calculated in chunk 5

After these steps answer the questions:

  1. The percentage of children who survived the Titanic accident was (Note: round to one decimal digit)
  2. The percentage of female survivors was ___ times higher as the percentage of male survivors. (Note: round to two decimal digits)

Solution

RMarkdown solution

  1. 52.3
  2. 3.45

 

Episode challenge

The goal of this challenge is to create a fully reproducible analysis within an R Markdown script which is easy to understand and read. For that, describe what you do and do not forget nice formatting. For example try different Markdown syntax (Headers, …), figure captions (Hint: check the chunk options for this), a meaningful YAML header, etc.

Analysis of the palmer penguins data

Create a new R Markdown document and write the code for each of the below questions in a separate paragraph/chunk and describe the result in a complete sentence directly in the R Markdown document. We will use the penguins dataset from the package palmerpenguins, available as the data set penguins after installing the package. To get an overview of the data load the penguins dataset and explore the following questions:

 

Question 1

Find the the source of the penguins dataset together with the url to the data repository. Hint: run ?penguins

Solution 1

library(palmerpenguins)

Can I put ?penguins in the chunk?

The source is: Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f

Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689

Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. doi: 10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e < Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

 

Question 2

Create a Markdown table describing the penguins data: for each numeric column in the dataset create a row in the table which should consist of the following columns: Column_Name, Mean, Variance
Hint: Checkout the function knitr::kable and the chunk option results='asis'.

Solution 2

numericcols <- sapply(colnames(penguins), function(x) is.numeric(penguins[[x]]))
df <- data.frame(Column_Name = names(numericcols)[numericcols],
                 Mean = signif(apply((na.omit(penguins[numericcols])), 2, mean), 4),
                 Variance = signif(apply((na.omit(penguins[numericcols])), 2, var), 4),
                 row.names = NULL
                 )
knitr::kable(df)
Column_Name Mean Variance
bill_length_mm 43.92 2.981e+01
bill_depth_mm 17.15 3.900e+00
flipper_length_mm 200.90 1.977e+02
body_mass_g 4202.00 6.431e+05
year 2008.00 6.678e-01

 

Question 3

How many rows does the penguins dataset have?

Solution 3

result_question_3 <- dim(penguins)[1]

The data set has 344 rows.

 

Question 4

What is the first year of records in the data set?

Solution 4

result_question_4 <- min(penguins$year)

The first year of records is 2007.

 

Question 5

What is the total number of Adelie penguins?

Solution 5

result_question_5 <- sum(penguins$species == "Adelie")

The total number of Adelie penguins is 152

 

Question 6

What is the total number of missing values (NA)?

Solution 6

result_question_6 <- sum(is.na(penguins))

The total number of missing values (NA’s) is 19.

 

Question 7

What is the total number of rows with no missing values?

Solution 7

result_question_7 <- sum(apply(penguins, 1, function(x) !any(is.na(x))))

The number of complete rows (rows with no missing values, i.e. NA’s) is rresult_question_7.

 

Question 8

On which islands were the Gentoo penguins found?

Solution 8

result_question_8 <- unique(penguins$island[penguins$species == "Gentoo"]);

The name of islands where the Gentoo penguins were found is Biscoe.

 

Question 9

What is the proportion of Adelie penguins on Dream island (compared to all penguins on Dream island)?

Solution 9

result_question_9 <- sum(penguins$species == "Adelie" & 
                           penguins$island == "Dream") / 
                      sum(penguins$island == "Dream")

The proportion of Adelie penguins on Dream island is 0.4516129.

 

Question 10

What is the 93% quantile of the bill lengths in mm?

Solution 10

result_question_10 <- quantile(na.omit(penguins$bill_length_mm), 0.93)

The 93 % quantile of bill_length_mm is 51.3.

 

Question 11

What is the absolute mean difference of bill depth in mm between female and male penguins?

Solution 11

result_question_11 <- abs(coef(lm(bill_depth_mm ~ sex, penguins))[2])

The absolute mean difference of bill_depth_mm between female and male is 1.4656169.

 

Question 12

What is the 95% confidence interval of the slope of the linear regression with intercept of bill depth regressed on sex? Result will be a vector of two elements, e.g. c('lower_limit', 'upper_limit').

Solution 12

result_question_12 <- confint(lm(bill_depth_mm ~ sex, penguins), "sexmale" )

The 95% confidence interval of the slope of the linear regression between bill_depth_mm and sex is 1.0710254, 1.8602083

 

Question 13

What is the proportion of Chinstrap penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45 compared to all penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45?

Solution 13

chins <- na.omit(penguins$species[penguins$flipper_length_mm < 205 & 
                                    penguins$bill_length_mm > 45])
result_question_13 <- sum(chins == "Chinstrap") / length(chins)

The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 is 0.9310345.

 

Question 14

What is the proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins?

Solution 14

result_question_14 <- sum(chins == "Chinstrap") / sum(penguins$species == "Chinstrap")

The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins is 0.7941176.

 

Bonus challenge

R Markdown tables

For the following challenge we will use the package kableExtra which extends the base capabilities of knitr::kable to create tables. From the package vignette:

_ The goal of kableExtra is to help you build common complex tables and manipulate table styles. It imports the pipe %>% symbol from magrittr and verbalize all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2 and plotly._

For users who are not very familiar with the pipe operator %>% in R: it is the R version of the fluent interface. The idea is to pass the result along the chain for a more literal coding experience. Basically, when we say A %>% B, technically it means sending the results of A to B as B’s first argument.

Simple tables can be generated as follows:

library(kableExtra)
head(penguins) %>% # the dataset, '%>%' parses the output of this command as the input in the next >command 
 kbl() %>% # the kableExtra equivalent of knitr::kable, base table
 kable_classic() # add theme to table
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

For all options check the documentation or the vignette.

 

Task 1 Create the following table:

bill length [mm]
bill depth [mm]
species 2007 2008 2009 2007 2008 2009
Adelie 38.8 38.6 39.0 18.8 18.2 18.1
Chinstrap 48.7 48.7 49.1 18.5 18.4 18.3
Gentoo 47.0 46.9 48.5 14.7 14.9 15.3

Hint: checkout the different styling functions, e.g. kable_classic.
Hint: For multiple column names use add_header_above
Hint: Use the following code to get started.

Solution

df_sum <- penguins %>% 
  dplyr::select(-sex, -island, -flipper_length_mm, -body_mass_g) %>% 
  dplyr::group_by(species, year) %>% 
  dplyr::summarise(dplyr::across(.fns = function(x) signif(mean(na.omit(x)), 3))) %>% 
  tidyr::pivot_wider(names_from = c(year), values_from = c(bill_length_mm, bill_depth_mm)) 

df_sum %>%
  kbl(col.names = c("species", rep(c("2007", "2008", "2009"), 2)))%>%
  kable_classic() %>%
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3)) %>%
  kable_styling(bootstrap_options = c("hover")) %>% 
  column_spec (c(1, 4), border_right = T) 

 

Task 2 Create the following table wich includes small graphs:

bill length [mm]
bill depth [mm]
species mean boxplot histogram mean boxplot histogram
Adelie 38.8 18.3
Chinstrap 48.8 18.4
Gentoo 47.5 15.0

Hint: Use column_spec for altering specific columns.

Solution

df_sum <- penguins %>% 
  dplyr::select(-island, -sex, -year, -body_mass_g, -flipper_length_mm) %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(dplyr::across(.cols = !contains("species"),
                                 .fns = function(x) 
                                   signif(mean(na.omit(x)), 3))) %>% 
  dplyr::mutate(bill_length_boxplot = "", bill_length_hist = "",
                bill_depth_boxplot = "", bill_depth_hist = "")

dfsum_list <- split(penguins$bill_length_mm, penguins$species)
dfsum_list2 <- split(penguins$bill_depth_mm, penguins$species)
df_sum %>%
  dplyr::select(species, 
                dplyr::starts_with("bill_length"), 
                dplyr::starts_with("bill_depth")) %>% 
  kbl(col.names = c("species", rep(c("mean", "boxplot", "histogram"), 2))) %>%
  kable_paper() %>%
  column_spec(1, border_right = TRUE) %>%
  column_spec(3, image = spec_boxplot(dfsum_list)) %>%
  column_spec(4, image = spec_hist(dfsum_list), border_right = T) %>%
  column_spec(6, image = spec_boxplot(dfsum_list2)) %>% 
  column_spec(7, image = spec_hist(dfsum_list2)) %>% 
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3),
                   border_right = TRUE, border_left = TRUE)

Key Points

  • Code-based analysis is better for reproducibility.

  • Combining narrative and code-based results is even more profitable.

  • Code chunks in R Markdown provide an easy solution