Reproducible notebooks for data analysis
Overview
Teaching: 60 min
Exercises: 120-150 minQuestions
Should I use a graphical user interface to analyse data or a code-based system?
What is literate programming and what is R Markdown?
How do I use R Markdown
Objectives
Understand the advantages of code-based data analysis
Able to create and adapt an R Markdown script
Adapt a YAML header
Use code chunks and choose the right chunk options
Practice R Markdown by answering simple questions
Bonus: add a table to an R Markdown script
1. Reproducibility of analysis code
Graphical user interface (GUI) vs. code
Statistical analysis software packages with graphical user interfaces such as SPSS (often) provide an easy solution to data analysis because
- one can get started with an analysis without much overhead in having to learn to code
- it is a convenient way of exploring the data
- figures and the statistical analysis are done at the end of the exploration
But it is often not clear anymore how exactly (and with which commands) those results were obtained and writing all steps down can be very tedious. Even if some kind of history of the executed commands is saved it might still be necessary to clean this up and keep only the relevant steps..
An analysis that is not reproducible can be an issue since
- it is easy to unintentionally introduce errors
- it is hard to get rid of errors
- if you need to change some step in the middle of the analysis you have to restart from scratch
- you can not provide a workflow that others (or yourself in the future) can follow easily to reproduce the results
On the other hand doing a statistical analysis purely with code is
- a lot of effort since all steps have to be written down explicitly
- needs some initial time investment in case of unfamiliarity
However, while it seems like a big hurdle modern programming languages designed for statistical computing, such as R, are usually pretty straightforward to learn and use, and they have a lot of advantages:
- Flexible: more combinations, functions and extensions are available
- Extensible (packages): in R, for example, thousands of packages are available.
- Reproducible (easily rerun from scratch)
- Mixing code and text increases readability
- Much more complete account of analysis than a methods section
- Start new projects from own or otherwise available code for similar analyses
R projects
Another very useful concept that enhances R’s usefulness via Rstudio are R Projects. They allow to
-
easily implement the folder structure that we heard about in the “First steps towards more reproducibility” episode (Organisation and software)
-
communicate between these directories because the correct working directory is set automatically
-
simplify collaborations since everybody can put their projects at desired location in their folder tree
-
quickly jump between different projects by remembering which files you had open and reopening them again
-
have more than one project open without mingling assigned variables etc.
Connected to the use of Project is the concept that you use relative file paths (e.g. for loading csv files). So instead of doing something like this: read.csv("/home/user/Documents/Uni/UnderstandingORS/Topic_5/data/example.csv")
you write read.csv("data/example.csv")
. This is easier to write, more flexible and less prone to errors because as long as you keep your files in the project together it will work. Imagine, for example, you want to move your script (and data) to "/home/user/backup/Uni/UnderstandingORS/Topic_5/data/example.csv"
.
Quiz on R projects
File path
Suppose your current working directory is ~/project and you want to specify the relative path to the file ~/project/data/data.csv . What are possible specifications?
- data.csv
- project/data.csv
- project/data/data.csv
- data/data.csv
Solution
F data.csv
F project/data.csv
F project/data/data.csv
T data/data.csv
2. Literate programming and R Markdown
R Markdown is a realisation of the literate programming concept mixing narrative text with analysis code which is then rendered into formatted text and analysis results (numbers, tables and graphics). The concept of literate programming goes back to Donald Knuth, see e.g. from the open-science-with-r carpentries course:
_ More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm: _
- Do not write your program only for R but think also of code readers (that includes you).
- Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.
- Explain the “why” and not the “how”.
- Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.
Parts of an R Markdown (.Rmd) file
Create a new Rmd
Execute the following steps on your computer while you read:
In Rstudio
- go to
File
>New File
>R Markdown
- enter
Title
andAuthor
- select
html
as output - confirm
A new .Rmd file should open with a short tutorial.
To render, or knit
, the file to html
press Knit
. The first time you run the script you will have to specify the name under which to save it. Afterwards the script is always saved before rendering.
YAML header
The YAML header of an R Markdown files contains the meta data that influence the final document in different ways. See the short summary from the open-science-with-r carpentries course:
_ The header of your R Markdown document will allow you to personalize the related report from your R Markdown document. The header follows the YAML syntax (“YAML Ain’t Markup Language”) which usually follows a
key:value
syntax. _
For example, title: "titlename"
, where title
is the key and "titlename"
is the value.
The header itself starts with ---
and stops with ---
. For example:
---
title: "titlename"
output: html_output
---
More information about the YAML header can be found in the R Markdown cheat sheet.
Code chunks
The narrative text of a report is written in the simple Markdown syntax. Code chunks are specific to R Markdown. They contain R code that is to be executed when rendering the chunk or the entire file, i.e. including the data analysis.
To start a chunk write (backticks) ` {r} `, then place your R code and end the chunk with `
. The
r in
```{r} ` indicates that the programming language used in this chunk is R. Other options include python
or bash
although we will not need these here.
Within RStudio a new code chunk can be included by either clicking on Insert a new code chunk
in the toolbar or using a keyboard shortcut (Ctrl+Alt+I
on Windows and Option+Command+I
on Mac).
Each chunk can be run separately. To run the code in an individual chunk click on the green arrow (Run Current Chunk
) on the right side of the chunk.
Alternatively use the keyboard shortcut Ctrl+Alt+T
(Windows) or Option+Command+T
(Mac) to run the current chunk (i.e. where your cursor is located). This runs only the code in the specific chunk but does not render the entire file.
For more options see the cheat sheet in R studio: Help
> Cheat Sheets
> R Markdown Cheat Sheet
or the link above.
The behavior of code chunks can be changed by setting chunk options. This is done in the opening of the chunk, e.g. ` ```{r, echo=FALSE}`, which hides the code of this chunk (while still evaluating it). For more options see the R Markdown cheat sheet or the R Markdown Cookbook.
Note: inline R code, i.e. code directly within the narrative text, can be run with `r `
, placing the code after r
. This is for example useful when you mention a sample size in the text and want it to update directly from the data set you read in.
Quiz on literate programming and R Markdown
Literate programming
Which of the following statements about literate programming are true?
- Literate programming combines code and text.
- Literate programming makes your code run more efficient.
- Literate programming makes your analysis easier to understand.
- Code should only be shown if necessary.
- Plots should not be included.
Solution
T Literate programming combines code and text.
F Literate programming makes your code run more efficient.
T Literate programming makes your analysis easier to understand.
F Code should only be shown if necessary.
F Plots should not be included.
YAML header: author
How do you specify the author in an R Markdown document?
- author: “name”
- Author: “name”
- Author = “name”
- Author: ‘name’
- author: name
Solution
T author: “name”
F Author: “name”
F Author = “name”
F Author: ‘name’
T author: name
YAML header: date
How do you set the date?
- Date: 01/01/2021
- Date = 01/01/2021
- datum: 01/01/2021
- date: 01/01/2021
Solution
F Date: 01/01/2021
F Date = 01/01/2021
F datum: 01/01/2021
T date: 01/01/2021
Chunk options
How do you prevent code from evaluation in a chunk?
- evaluate=FALSE
- eval=FALSE
- noeval=TRUE
- hinder=TRUE
- hind=TRUE
- interpret=FALSE
- inpr=FALSE
Solution
F evaluate=FALSE
T eval=FALSE
F noeval=TRUE
F hinder=TRUE
F hind=TRUE
F interpret=FALSE
F inpr=FALSE
Chunk options: figure height
How do you adjust the figure height?
- figure_height=100
- figureheight=100
- heightfigure=100
- fig_height=100
- fig.height=100
- height.fig=100
Solution
F figure_height=100
F figureheight=100
F heightfigure=100
F fig_height=100
T fig.height=100
F height.fig=100
R Markdown Practice
Modify the template found here by performing the following steps:
- add author and date to YAML header
- rename the first chunk to
print_chunk_nr_1
- set the chunk options of chunk 2 to not show the code
- set the chunk options of chunk 3 to not evaluate the code but show it
- set the chunk options of chunk 4 to not show the warning
- complete the sentence at the end with appropriate information calculated in chunk 5
After these steps answer the questions:
- The percentage of children who survived the Titanic accident was (Note: round to one decimal digit)
- The percentage of female survivors was ___ times higher as the percentage of male survivors. (Note: round to two decimal digits)
Solution
- 52.3
- 3.45
Episode challenge
The goal of this challenge is to create a fully reproducible analysis within an R Markdown script which is easy to understand and read. For that, describe what you do and do not forget nice formatting. For example try different Markdown syntax (Headers, …), figure captions (Hint: check the chunk options for this), a meaningful YAML header, etc.
Analysis of the palmer penguins data
Create a new R Markdown document and write the code for each of the below questions in a separate paragraph/chunk and describe the result in a complete sentence directly in the R Markdown document. We will use the
penguins
dataset from the packagepalmerpenguins
, available as the data setpenguins
after installing the package. To get an overview of the data load thepenguins
dataset and explore the following questions:
Question 1
Find the the source of the
penguins
dataset together with the url to the data repository. Hint: run?penguins
Solution 1
library(palmerpenguins)
Can I put ?penguins in the chunk?
The source is: Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f
Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689
Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. doi: 10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e < Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
Question 2
Create a Markdown table describing the
penguins
data: for each numeric column in the dataset create a row in the table which should consist of the following columns:Column_Name
,Mean
,Variance
Hint: Checkout the functionknitr::kable
and the chunk optionresults='asis'
.Solution 2
numericcols <- sapply(colnames(penguins), function(x) is.numeric(penguins[[x]])) df <- data.frame(Column_Name = names(numericcols)[numericcols], Mean = signif(apply((na.omit(penguins[numericcols])), 2, mean), 4), Variance = signif(apply((na.omit(penguins[numericcols])), 2, var), 4), row.names = NULL ) knitr::kable(df)
Column_Name Mean Variance bill_length_mm 43.92 2.981e+01 bill_depth_mm 17.15 3.900e+00 flipper_length_mm 200.90 1.977e+02 body_mass_g 4202.00 6.431e+05 year 2008.00 6.678e-01
Question 3
How many rows does the
penguins
dataset have?Solution 3
result_question_3 <- dim(penguins)[1]
The data set has 344 rows.
Question 4
What is the first year of records in the data set?
Solution 4
result_question_4 <- min(penguins$year)
The first year of records is 2007.
Question 5
What is the total number of Adelie penguins?
Solution 5
result_question_5 <- sum(penguins$species == "Adelie")
The total number of Adelie penguins is 152
Question 6
What is the total number of missing values (
NA
)?Solution 6
result_question_6 <- sum(is.na(penguins))
The total number of missing values (NA’s) is 19.
Question 7
What is the total number of rows with no missing values?
Solution 7
result_question_7 <- sum(apply(penguins, 1, function(x) !any(is.na(x))))
The number of complete rows (rows with no missing values, i.e. NA’s) is
rresult_question_7
.
Question 8
On which islands were the Gentoo penguins found?
Solution 8
result_question_8 <- unique(penguins$island[penguins$species == "Gentoo"]);
The name of islands where the Gentoo penguins were found is Biscoe.
Question 9
What is the proportion of Adelie penguins on Dream island (compared to all penguins on Dream island)?
Solution 9
result_question_9 <- sum(penguins$species == "Adelie" & penguins$island == "Dream") / sum(penguins$island == "Dream")
The proportion of Adelie penguins on Dream island is 0.4516129.
Question 10
What is the 93% quantile of the bill lengths in mm?
Solution 10
result_question_10 <- quantile(na.omit(penguins$bill_length_mm), 0.93)
The 93 % quantile of
bill_length_mm
is 51.3.
Question 11
What is the absolute mean difference of bill depth in mm between female and male penguins?
Solution 11
result_question_11 <- abs(coef(lm(bill_depth_mm ~ sex, penguins))[2])
The absolute mean difference of
bill_depth_mm
between female and male is 1.4656169.
Question 12
What is the 95% confidence interval of the slope of the linear regression with intercept of bill depth regressed on sex? Result will be a vector of two elements, e.g.
c('lower_limit', 'upper_limit')
.Solution 12
result_question_12 <- confint(lm(bill_depth_mm ~ sex, penguins), "sexmale" )
The 95% confidence interval of the slope of the linear regression between
bill_depth_mm
andsex
is 1.0710254, 1.8602083
Question 13
What is the proportion of Chinstrap penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45 compared to all penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45?
Solution 13
chins <- na.omit(penguins$species[penguins$flipper_length_mm < 205 & penguins$bill_length_mm > 45]) result_question_13 <- sum(chins == "Chinstrap") / length(chins)
The proportion of Chinstrap penguins with
flipper_length_mm
smaller than 205 andbill_length_mm
larger than 45 compared to all penguins withflipper_length_mm
smaller than 205 andbill_length_mm
larger than 45 is 0.9310345.
Question 14
What is the proportion of Chinstrap penguins with
flipper_length_mm
smaller than 205 andbill_length_mm
larger than 45 compared to all Chinstrap penguins?Solution 14
result_question_14 <- sum(chins == "Chinstrap") / sum(penguins$species == "Chinstrap")
The proportion of Chinstrap penguins with
flipper_length_mm
smaller than 205 andbill_length_mm
larger than 45 compared to all Chinstrap penguins is 0.7941176.
Bonus challenge
R Markdown tables
For the following challenge we will use the package
kableExtra
which extends the base capabilities ofknitr::kable
to create tables. From the package vignette:_ The goal of kableExtra is to help you build common complex tables and manipulate table styles. It imports the pipe
%>%
symbol from magrittr and verbalize all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2 and plotly._For users who are not very familiar with the pipe operator
%>%
in R: it is the R version of the fluent interface. The idea is to pass the result along the chain for a more literal coding experience. Basically, when we say A%>%
B, technically it means sending the results of A to B as B’s first argument.Simple tables can be generated as follows:
library(kableExtra) head(penguins) %>% # the dataset, '%>%' parses the output of this command as the input in the next >command kbl() %>% # the kableExtra equivalent of knitr::kable, base table kable_classic() # add theme to table
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year Adelie Torgersen 39.1 18.7 181 3750 male 2007 Adelie Torgersen 39.5 17.4 186 3800 female 2007 Adelie Torgersen 40.3 18.0 195 3250 female 2007 Adelie Torgersen NA NA NA NA NA 2007 Adelie Torgersen 36.7 19.3 193 3450 female 2007 Adelie Torgersen 39.3 20.6 190 3650 male 2007 For all options check the documentation or the vignette.
Task 1 Create the following table:
bill length [mm] bill depth [mm]species 2007 2008 2009 2007 2008 2009 Adelie 38.8 38.6 39.0 18.8 18.2 18.1 Chinstrap 48.7 48.7 49.1 18.5 18.4 18.3 Gentoo 47.0 46.9 48.5 14.7 14.9 15.3 Hint: checkout the different styling functions, e.g.
kable_classic
.
Hint: For multiple column names useadd_header_above
Hint: Use the following code to get started.Solution
df_sum <- penguins %>% dplyr::select(-sex, -island, -flipper_length_mm, -body_mass_g) %>% dplyr::group_by(species, year) %>% dplyr::summarise(dplyr::across(.fns = function(x) signif(mean(na.omit(x)), 3))) %>% tidyr::pivot_wider(names_from = c(year), values_from = c(bill_length_mm, bill_depth_mm)) df_sum %>% kbl(col.names = c("species", rep(c("2007", "2008", "2009"), 2)))%>% kable_classic() %>% add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3)) %>% kable_styling(bootstrap_options = c("hover")) %>% column_spec (c(1, 4), border_right = T)
Task 2 Create the following table wich includes small graphs:
bill length [mm] bill depth [mm]species mean boxplot histogram mean boxplot histogram Adelie 38.8 18.3 Chinstrap 48.8 18.4 Gentoo 47.5 15.0 Hint: Use
column_spec
for altering specific columns.Solution
df_sum <- penguins %>% dplyr::select(-island, -sex, -year, -body_mass_g, -flipper_length_mm) %>% dplyr::group_by(species) %>% dplyr::summarise(dplyr::across(.cols = !contains("species"), .fns = function(x) signif(mean(na.omit(x)), 3))) %>% dplyr::mutate(bill_length_boxplot = "", bill_length_hist = "", bill_depth_boxplot = "", bill_depth_hist = "") dfsum_list <- split(penguins$bill_length_mm, penguins$species) dfsum_list2 <- split(penguins$bill_depth_mm, penguins$species) df_sum %>% dplyr::select(species, dplyr::starts_with("bill_length"), dplyr::starts_with("bill_depth")) %>% kbl(col.names = c("species", rep(c("mean", "boxplot", "histogram"), 2))) %>% kable_paper() %>% column_spec(1, border_right = TRUE) %>% column_spec(3, image = spec_boxplot(dfsum_list)) %>% column_spec(4, image = spec_hist(dfsum_list), border_right = T) %>% column_spec(6, image = spec_boxplot(dfsum_list2)) %>% column_spec(7, image = spec_hist(dfsum_list2)) %>% add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3), border_right = TRUE, border_left = TRUE)
Key Points
Code-based analysis is better for reproducibility.
Combining narrative and code-based results is even more profitable.
Code chunks in R Markdown provide an easy solution