This lesson is still being designed and assembled (Pre-Alpha version)

Understanding Open and Reproducible Science

Scientific integrity, Open Science and reproducibility

Overview

Teaching: 60 min
Exercises: 60-90 min
Questions
  • What is Scientific integrity and what is the link to Open Science and reproducibility?

  • What is Open Science and which aspects are important to me?

  • What is reproducibility and why should I care about it?

Objectives
  • Understand the connections between scientific integrity, Open Science and reproducibility

  • Name the requirements on designing, carrying out and reporting of research projects such that scientific integrity is respected

  • Discrimate between so-called negative and positive results

  • List all/many of the dimensions of Open Science

  • Explain why and know where to preregister studies

  • Apply these concepts when reading about research

1. What is scientific integrity and what is the link to Open Science and reproducibility?

Scientific/research integrity at the University of Zurich

Often when the term “Scientific integrity” comes up one would think about topics such as

Note that conflicts of interest can also be the subject of studies: https://doi.org/10.1186/s13643-020-01318-5

For each of these topics we have the University of Zurich links here, most other universities will also have corresponding regulations in place. But these topics are not the main interest of this course, we will instead focus at the aspects of research integrity discussed below.

National and international guidance documents on research integrity

Several guidance documents exist, see three European examples here:

⇒ We will have a brief look at each of the documents and work on the Swiss document in more detail.

LERU: Towards a research integrity culture at Universities

In a summary chapter the guidance document states what Universities should do to empower sound research:

Improve the design and conduct of research:

Improve the soundness of reporting

⇒ The points in bold are topics of this course and directly related to reproducibility as we will see below and later.

European code of conduct for research integrity

The EU code states that Good research practices are based on fundamental principles of research integrity.

⇒ You will find these same main principles in the Swiss guidance document! Adhering to the principles of reliability, honesty and accountability requires, among other aspects, to work reproducibly and openly.

The Swiss code of conduct for scientific integrity

The same principles occur in the Swiss document, here with a direct pointer to reproducibility:

“Reliability, honesty, respect, and accountability are the basic principles of scientific integrity. They underpin the independence and credibility of science and its disciplines as well as the accountability and reproducibility of research findings and their acceptance by society. As a system operating according to specific rules, science has a responsibility to create the structures and an environment that foster scientific integrity.”

Quiz on the Swiss code of conduct for scientific integrity

For these questions, please read or search the Code until page 26.

Audience

At which of the following groups of people is the code of conduct aimed at?

  • researchers at research performing institutions
  • educators at higher education institutions
  • administrative staff at research performing institutions
  • students at higher education institutions

Solution

T researchers at research performing institutions
T educators at higher education institutions
F administrative staff at research performing institutions
T students at higher education institutions

Reliability

For reliability researchers need to use, e.g.,

  • appropriate study designs
  • the most current methods
  • simple analysis methods
  • transparent reporting
  • traceable materials and data

Solution

T appropriate study designs
F the most current methods
F simple analysis methods
T transparent reporting
T traceable materials and data

Computer code

The code does not mention reproducible code (in the sense of computer code) directly. Find an implicit location where the use of reproducible code is implied by the standards of Chapter 4. Copy the entire bullet point or just the relevant verb.

Solution

The Code states “Researchers should design, undertake, analyse, document, and publish their research with care and with an awareness of their responsibility to society, the environment, and nature.” Using a scripting language for data analysis and providing the corresponding code hence caters to the “documenting” step.

Negative results

The non-publication of so-called negative results can be seen as a violation of scientific integrity. Find the behavior in Chapter 5 of the Code which this can be related to.

Solution

The Code lists “omitting or withholding data and data sources” as a behavior wich is an examples of scientific misconduct.

Example

Publication of negative results

Therapeutic fashion and publication bias: the case of anti-arrhythmic drugs in heart attack

  • In the 1970s, it was found that the local anaesthetic drug lignocaine (lidocaine) suppressed arrhythmias after heart attacks
  • That this claim was wrong was difficult to recognise from small clinical trials looking only at effects on arrhythmias, not outcomes that really matter, like deaths.
  • Large clinical trials in the late 1980s showed that the drugs actually increased mortality.
  • The results of Hampton and co-authors’ small but negative trial regarding the anti-arrhythmic agent lorcainide were not published because no journal was willing to do so at the time.
  • A cumulative meta-analysis of previous anti-arrhythmic trials would have helped avoid tens of thousands of unnecessarily early deaths, even more so if results like those of Hampton and co-authors would have been available.
  • With the words ‘publication bias’ in the title, the trial results could finally be published in the early 1990s:
    Therapeutic fashion and publication bias: the case of anti-arrhythmic drugs in heart attack

J Hampton https://journals.sagepub.com/doi/10.1177/0141076815608562

Bottom line: This is a very impressive example of the consequences of non-publication of “negative” results. The authors themselves are not to blame, they have maintained their integrity as researchers. The example shows that the publication of all results is indeed a principle of research integrity in the sense of the integrity of the research record as a whole.

 

2. What is Open Science?

Let´s play the game “Open up your research”

https://www.openscience.uzh.ch/de/moreopenscience/game.html

Dimensions of Open Science

Which decisions did Emma need to take in the game?

Solution

  1. Involve a librarian?
  2. Write a data management plan?
  3. Preregister her research plan?
  4. Make her data FAIR?
  5. Publish Open Access?
  6. Publish data and/or code?

UNESCO recommendation on Open Science

In 2021 UNESCO published their recommendations for Open Science. From their point of view Open Science is a tool helping to create a sustainable future. In the bold face part of the quote we see the link of Open Science to scientific integrity and also reproducibility:

“Building on the essential principles of academic freedom, research integrity and scientific excellence, open science sets a new paradigm that integrates into the scientific enterprise practices for reproducibility, transparency, sharing and collaboration resulting from the increased opening of scientific contents, tools and processes.”

Image credit: UNESCO Recommendation on Open Science, CC-BY-SA.

Optional: Read the full recommendation text at https://en.unesco.org/science-sustainable-future/open-science/recommendation.

Open Science made easy by the Open Science in Psychology/Social Science initiatives

The Open Science in Psychology/Social Science initiatives summarize and explain the practice of Open Science in seven steps: https://osf.io/hktmf/. Some of these steps were also part of Emma’s decision process. Here we show an abbreviated version of the seven steps:

Image credit: Eva Furrer, unlicensed, abbreviated version of https://osf.io/hktmf/.

We will revisit the following steps in this lesson:

  1. Create OSF account (use easy infrastructure for collaboration)
  2. Pregregister your own studies
  3. Open Data
  4. Reproducible Code
  5. Open Access (preprints)

What is preregistration?

The Open up your research game and the seven steps above refer to preregistration. But what is preregistration? The Texas sharp shooter cartoon shows an unregistered experiment. The shooter first shoots and then draws the bull´s eyes around his shots. He did not preregister where he wanted to shoot before shooting.

Image credit: Illustration by Dirk-Jan Hoek, CC-BY.

When a researcher preregisters a study, the design and precise goal of the study are declared openly in advance: the bull´s eye is drawn.

Origins of preregistration: clinical trials

A clinical trial is an experiment involving human volunteers for example in the development of a new drug. Registration of clinical trials, i.e. announcing that a trial will be conducted and what its goal is before any data are collected, has become a standard since the late 1990s. It is considered a scientific, ethical and moral responsibility for all trials because:

  • Informed decisions are difficult under publication bias and selective reporting, i.e. the non-publication negative results and the focus on publication of positive results which might not reflect the original goals. Publication bias and selective reporting result in a biased view of the situation.
  • Describing clinical trials in progress simplifies identification of research gaps
  • The early identification of potential problems contributes to improvements in the quality

The Declaration of Helsinki requires since the late 1990s: “Every clinical trial must be registered […]”

Registries (non-exhaustive list)

Here is a list of registries, where (pre)registration can be done:

  • Clinicaltrials.gov: US and international registry for clinical trials, first of its kind, established 1997: https://clinicaltrials.gov/

  • OSF: General purpose registry, also a research management tool (not just for preregistration), embargo possible for up to 4 years: https://osf.io/

  • Aspredicted: General purpose registry, protocols can be private forever, possibility to automatically delete an entry after 24 hours:
    https://aspredicted.org/

  • Preclinicaltirals.ed: Comprehensive listing of preclinical animal study protocols
    https://preclinicaltrials.eu/

  • PROSPERO International prospective register of systematic reviews
    https://www.crd.york.ac.uk/prospero/

Quiz on registration

Does registration show an effect?

All large National Heart Lung, and Blood Institute (NHLBI) supported randomized controlled trials between 1970 and 2012 evaluating drugs or dietary supplements for the treatment or prevention of cardiovascular disease are shown with their reported outcome measure in the graphic. Trials were included if direct costs were bigger than 500,000$/year, participants were adult humans, and the primary outcome was cardiovascular risk, disease or death.


Image Credit: R Kaplan and V Irvin https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132382, CC-BY.

What is the difference between what you observe before and after the year 2000 in this graphic?

Solution

Before 2000 one sees many positive effects, i.e. treatments that lower the relative risk of cardiovascular disease, but also null effects, in general the effects are larger. After registration of the primary outcome becomes mandatory, less outcome switching can occur and many more null effects are reported. The policy change helped to overcome this particular aspect of selective reporting.

 

3. What is reproducibility?

Reproducibility vs replicability

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. This requires, at minimum, the sharing of data sets, relevant metadata, analytical code, and related software.

Replicability refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.

See S Goodman et al. https://www.science.org/doi/10.1126/scitranslmed.aaf5027 for a finer grained discussion of the concepts.

What is reproducibility?

“This is exactly how it seems when you try to figure out how authors got from a large and complex data set to a dense paper with lots of busy figures. Without access to the data and the analysis code, a miracle occurred. And there should be no miracles in science.”

See artwork by Sidney Harris at http://www.sciencecartoonsplus.com/ for an illustration of the remark “I think you should be more explicit here in step two” when a miracle occurs.

The quote is from F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7. In this publication the author asks what working reproducibly means for his daily work and comes up with “Five selfish reasons to work reproducibly”, this is even the title of the paper.

“Working transparently and reproducibly has a lot to do with empathy: put yourself into the shoes of one of your collaboration partners and ask yourself, would that person be able to access my data and make sense of my analyses. Learning the tools of the trade will require commitment and a massive investment of your time and energy. A priori it is not clear why the benefits of working reproducibly outweigh its costs.”

In this course we will learn about some of the tools Markowetz lists in his paper.

(Anti-)Example from the Markowetz paper

How bright promise in cancer testing fell apart

Image Credit: adapted from the open access article by K Baggerly and K Coombes https://projecteuclid.org/journals/annals-of-applied-statistics/volume-3/issue-4/Deriving-chemosensitivity-from-cell-lines–Forensic-bioinformatics-and-reproducible/10.1214/09-AOAS291.full.

From G Kolata https://www.nytimes.com/2011/07/08/health/research/08genes.html.

“When Juliet Jacobs found out she had lung cancer, she was terrified, but realized that her hope lay in getting the best treatment medicine could offer. So she got a second opinion, then a third. In February of 2010, she ended up at Duke University, where she entered a research study whose promise seemed stunning.

Doctors would assess her tumor cells, looking for gene patterns that would determine which drugs would best attack her particular cancer. She would not waste precious time with ineffective drugs or trial-and-error treatment. The Duke program — considered a breakthrough at the time — was the first fruit of the new genomics, a way of letting a cancer cell’s own genes reveal the cancer’s weaknesses.

But the research at Duke turned out to be wrong. Its gene-based tests proved worthless, and the research behind them was discredited. Ms. Jacobs died a few months after treatment, and her husband and other patients’ relatives have retained lawyers.”

Markowetz wonders in his paper why no one noticed these issues before it was too late. And he comes to the conclusion that the reason was that the data and analysis were not transparent and required forensic bioinformatics to untangle

Those forensic bioinformatics were provided by K Baggerly and K Coombes https://projecteuclid.org/journals/annals-of-applied-statistics/volume-3/issue-4/Deriving-chemosensitivity-from-cell-lines–Forensic-bioinformatics-and-reproducible/10.1214/09-AOAS291.full:

“Poor documentation hid an off-by-one indexing error affecting all genes reported, the inclusion of genes from other sources, including other arrays (the outliers), and a sensitive/resistant label reversal.”

Bottom line: Data analyses that are done using reproducible code and that are documented well are easier to check, for the analysts themselves and for others. Such practices decrease the chances that errors as in this example are made and this outweighs the effort and time they cost.

   

Episode challenge

A waste of 1000 research papers

Read the article “A Waste of 1000 Research Papers” by Ed Yong (The Atlantic, 27.5. 2019).

Question 1

Find situations in the article where publication bias, preregistration and data sharing could have aided to avoid such waste. Copy the corresponding lines from the article and name one or two reasons why you think that those concepts could have helped.

Question 2

Use smart search terms to find the concepts such that you do not need to read the entire research article.

Question 3

Go to the research article of Border et al. that is mentioned in Yong’s article and find out which of the above concepts have been respected in this article. Justify with citations.

Question 4

What are your overall conclusions?

Solution

No solution provided here.

Key Points

  • Scientific integrity, Open Science and reproducibility are connected.

  • All three themes are important for the trustworthiness of research results

  • The tools that will be taught in this course help to increase trustworthiness


First steps towards more reproducibility

Overview

Teaching: 60 min
Exercises: 90-120 min
Questions
  • Is there a reproducibility/replicability crisis?

  • How do I organize projects and software code to favor reproducibility?

  • How do I handle data in spreadsheets to favor reproducibility?

Objectives
  • Practice good habits in file and folder organization which favours reproducibiity

  • Practice good habits in data organization in spreadsheets which favour reproducibility

  • Some practical tips for the use of Rstudio (optional)

1. Is there a reproducibility/replicability crisis?

First, we will look at anecdotal and empirical evidence of issues with reproducibility/replicability in the scientific literature. Along the way we point to the pertinent of Markowetz’ five selfish reasons for working reproducibly. This episode hence gives some insight on the background and a few first practical tools for reproducible research practice.

 

Recall: Reproducibility vs Replicability

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. This requires, at minimum, the sharing of data sets, relevant metadata, analytical code, and related software.

Replicability refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.

See S Goodman et al.https://www.science.org/doi/10.1126/scitranslmed.aaf5027 for a finer grained discussion of the concepts.

Retracted Nature publication

See this example of a publication, W Huang et al. https://www.nature.com/articles/s41586-018-0311-z, published in the prestigious journal Nature, that had issues with replication:


The publication contains an important finding on how a special type of RNA can affect T-cells, which could imply therapeutic effects and lead to new drugs for autoimmune diseases. But some of the co-authors were unable to replicate the published results and hence initiated a retraction. The retraction note states:
“In follow-up experiments to this article, we have been unable to replicate key aspects of the original results.”

For more information, see the commentary on Retraction Watch.

⇒ Replication may be an issue for other published research results, here the authors themselves discovered it. See below for some concerted efforts to study replicability of published scientific results.

Selfish reason number 5: reproducibility helps to build your reputation

“Generally, making your analyses available in this way will help you to build a reputation for being an honest and careful researcher. Should there ever be a problem with one of your papers, you will be in a very good position to defend yourself and to show that you reported everything in good faith.”

F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Issues with reproducibility/replication affect many fields

Is retraction due to error or non-feasibility of replication rare? Attempts at replication of results become more frequent. For example, in some fields there have been concerted efforts of aiming to replicate larger sets of studies allowing to assess reliability of results on a wider scale.

Biomedicine

F Prinz et al. attempted “To substantiate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and validation) in-house projects in our strategic research fields of oncology, women’s health and cardiovascular diseases that were performed over the past 4 years.” From 67 studies only 22 were fully or partially confirmed, see https://www.nature.com/articles/nrd3439-c1.

Psychology

The Open Science Collaboration in Psychology “conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available.” in an attempt to “Estimating the reproducibility of psychological science”. Only 39 of 100 replication studies were significant and the mean effect size was about 50% compared to the original studies, see https://www.science.org/doi/10.1126/science.aac4716.

Economics

C Camerer et al. replicated “18 studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. All of these replications followed predefined analysis plans that were made publicly available beforehand, and they all have a statistical power of at least 90% to detect the original effect size at the 5% significance level.” They found significant effects in 11 of 18 studies with a mean effect size of about 66% compared to the original studies, see https://www.science.org/doi/10.1126/science.aaf0918.

Social Sciences

Again C Camerer et al. “replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015.” They found significant effects in 13 of 21 studies with mean effect size about 50% compared to the original studies, see https://www.nature.com/articles/s41562-018-0399-z.

Selfish reason number 1: reproducibility helps to avoid disaster

“This experience showed me two things. First of all, a project is more than a beautiful result. You need to record in detail how you got there. And second, starting to work reproducibly early on will save you time later. We wasted years of our and our collaborators’ time by not being able to reproduce our own results. All of this could have been avoided by keeping better track of how the data and analyses evolved over time.”

F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

1,500 scientists lift the lid on reproducibility

In 2016 M Baker designed a survey meant to shed “light on the ‘crisis’ rocking research.” Here we discuss some of the results of the survey, for a complete report see https://www.nature.com/articles/533452a. The two graphs reproduced from raw data of the publication show that a large proportion of researchers believes that there are issues with reproducibility but that, again in the opinion of researchers, the extent of the problem differs between disciplines. Specifically, researchers from the “hard” sciences such as chemistry and physics, more frequently believe that the published work in their field is reproducible than for example in the “softer” sciences biology and medicine.


Image credit: Figures are reproduced from https://www.nature.com/articles/533452a with the data available on Figshare

Factors contributing to irreproducible research

Baker also tried to evaluate which factors could contribute to this perceived reproducibility issue. Most researchers (more than 95%) believe that selective reporting and pressure to publish always/often or sometimes contribute to irreproducibility. Still about 90% believe that low statistical power or poor analysis, not enough replication in the original lab and insufficient mentoring/oversight always/often or sometimes contribute. Around 80% agree with unavailability of methods/code, poor experimental design, unavailability of raw data and unsufficient peer review as contributing factors at least sometimes. Fraud plays a more minor role in the opinion of researchers.


Image credit: Figures are reproduced from https://www.nature.com/articles/533452a with the data available on Figshare

 

Quiz on reproducibility/replicability

Effect size

Within the concerted replication efforts effect sizes of the replication attempts are on average (for one of them we do not have the information)

  • smaller than the original effect
  • approximately the same as the original effect
  • bigger than the original effect

Solution

T smaller than the original effect
F approximately the same as the original effect F bigger than the original effect

Factors contributing to irreproducibility

Peeking at the content below, with which of the above factors that contribute to irreproducible research is the current episode of this course concerned?

Solution

Methods, code unavailable

 

2. Organization and software

In this section we learn about simple tools to avoid the fear in Markowetz’ selfish reason number 4.

Selfish reason number 4: reproducibility enables continuity of your work

“I did this analysis 6 months ago. Of course I can’t remember all the details after such a long time.”

F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Project organization

The main principles of data analytic project organization is the separation of

Project organization checklist

To achieve these principles make sure that you follow a procedure similar to:

  1. Put each project in its own directory named after the project.
  2. Put text associated documents in the doc directory.
  3. Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
  4. Put project source code in the src directory.
  5. Put external scripts or compiled programs in the bin directory.
  6. Name all files to reflect their content or function.

From Good enough practices in scientific computing by G Wilson et al. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510

 

In Packaging Data Analytical Work Reproducibly Using R (and Friends) B Marwick et al. https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375986 suggest a slightly different but conceptually similar approach. They propose to organize projects as so-called “research compendia”, for example like:

Image credit: Illustration of research compendia as suggested in B. Marwick et al. by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994355.

Software/code

Writing code for a data analysis instead of using a GUI based tool makes an analysis to some degree reproducible (given the availability of the data and the analogous functioning of the computing environment). But code can also be a very detailed documentation of the employed methods, at least if it is written in a way such that it is understandable.

Selfish reason number 3: reproducibility helps reviewers see it your way

“One of the reviewers proposed a slight change to some analyses, and because he had access to the complete analysis, he could directly try out his ideas on our data and see how the results changed.”

F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Code understandability checklist

Use the following principles that make code easier to understand and use by others and your future self

  1. Place a brief explanatory comment at the start of every program.
  2. Decompose programs into functions.
  3. Be ruthless about eliminating duplication.
  4. Search for well-maintained libraries that do what you need.
  5. Test libraries before relying on them.
  6. Give functions and variables meaningful names.
  7. Make dependencies and requirements explicit.
  8. Do not comment and uncomment code sections to control behavior.
  9. Provide a simple example or test data set.

⇒ Your main goal with these principles is for your code to be readable, reusable, testable

From G Wilson et al. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510

On top of these high level recommendations writing and reading code is easier if one adheres to some styling rules. We have assembled our ten most important rules for code styling in R, these were influenced by https://style.tidyverse.org, https://google.github.io/styleguide/Rguide.html, https://cfss.uchicago.edu/notes/style-guide/ and by lot of experience in reading code by others and our past selves.

10 Rules for code styling (in R)

  1. Code styling is about readability not about correctness. The most important factor for readability is consistency which also increases writing efficiency.
  2. Use white space for readability, spaces around operators (e.g. +), after commas and before %>%, line breaks before each command and after each %>%.
  3. Control the length of your code lines to be about 80 characters. Short statements, even loops etc, can be a single line.
  4. Indent your code consistently, the preferred way of indentation are two spaces.
  5. Use concise and informative variable names, do not use spaces, link by underscore or use CamelCase. Avoid names, that are already used, e.g., mean, c.
  6. Comment your code such that its structure is visible and findable (use code folding in RStudio).
  7. Do not use the equal sign for assignment in R, <- is the appropriate operator for this. Avoid right-hand assignment, ->, since it deteriorates readability.
  8. Curly braces are a crucial programming tool in R. The opening { should be the last character on the line, the closing } the first (and last) on the line.
  9. File naming is part of good programming style. Do not use spaces or non-standard characters, use consistent and informative names.
  10. Finally, do use the assistance provided by RStudio: command/control + i and shift + command/ctrl + A.

Quiz on organization and software

Duplication

Which of the following situations are meant by the principle “be ruthless about duplication”

  • Copy-pasting code for several cases of the same type of calculation
  • Several lines of code that are repeated at different locations in a larger script
  • The duplication of statistical results with two approaches
  • The same type of graph used for several cases

Solution

T Copy-pasting code for several cases of the same type of calculation
T Several lines of code that are repeated at different locations in a larger script
F The duplication of statistical results with two approaches
F The same type of graph used for several cases

Directories

Which directories would you use for cleaned data files of .csv format?

  • results
  • data
  • doc
  • results/cleaneddata

Solution

T results
F data
F doc
T results/cleaneddata

 

3. Data in spreadsheets

Selfish reason number 2: reproducibility makes it easier to write papers

“Transparency in your analysis makes writing papers much easier. For example, in a dynamic document (Box 1) all results automatically update when the data are changed. You can be confident your numbers, figures and tables are up-to-date. Additionally, transparent analyses are more engaging, more eyes can look over them and it is much easier to spot mistakes.”

F Markowetz https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

 


Image credit: Randall Munroe/xkcd at https://xkcd.com/2180/ licensed as CC BY-NC.

Humor aside, spreadsheets have advantages and disadvantages, that can threaten reproducibility. But they are easy to use and so widespread that we better learn how to use them properly. And indeed data in spreadsheets can be organized in a way that favors reproducibility. We will summarize the recommendations of the article by K Broman and K Woo https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 into five checklists below. Broman and Woo promise that:

“By following this advice, researchers will create spreadsheets that are less error-prone, easier for computers to process, and easier to share with collaborators and the public. Spreadsheets that adhere to our recommendations will work well with the tidy tools and reproducible methods described elsewhere in this collection and will form the basis of a robust and reproducible analytic workflow.”

 

Spreadsheet consistency checklist

  1. Use consistent codes for categorical variables.
  2. Use a consistent fixed code for any missing values.
  3. Use consistent variable names.
  4. Use consistent subject identifiers.
  5. Use a consistent data layout in multiple files.
  6. Use consistent file names.
  7. Use a consistent format for all dates.
  8. Use consistent phrases in your notes.
  9. Be careful about extra spaces within cells.


Image credit: Image credit: copyright 2023, William F. Hertha under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Choose good names for files and variables checklist

  • No spaces
  • Use underscores or hyphens or periods (only one of them)
  • No special characters (&,*,%,ü,ä,ö,…)
  • Use a unique, short but meaningful name
  • Variable names have to start with a letter
  • File names: include zero-padded version number, e.g. V013
  • File names: include consistent date, e.g. YYYYMMDD

 

Be careful with dates checklist

  • Use the ISO 8601 global standard
  • Convention for dates in Excel is different on Windows and Mac computers
  • Dates have an internal numerical representation
  • Best to declare date columns as text, but only works prospectively
  • Consider separate year, month, day columns

Image credit: Randall Munroe/xkcd at https://xkcd.com/1179/ licensed as CC BY-NC.

Make your data truly readable and rectangular checklist

  • Put one information of the same form per cell
  • Do not add remarks in cells which should contain numerical values, e.g. >10000
  • Include one variable per column, one row per subject: a rectangle of data
  • Use the first and only the first row for variable names
  • Do not calculate means, standard deviations etc in the last row
  • Do not color, highlight or merge cells to codify information
  • Use data validation at data entry
  • Be careful with commas since they may be decimal separators
  • Consider write protecting a file at the end of data collection

 

Code book/data dictionary checklist

  • Create a code book in a separate sheet or file
  • Code book contains

    • a short description
    • unit and max/min values for continuous variables
    • all levels with their code for categorical variables
    • ordering for ordinal variables
  • All variables have to be contained in the code book

 

Quiz on data in spreadsheets

Variable names

What are good names for the variable containing average height per age class?

  • averageheightperageclass
  • av_height_agecls
  • height/class
  • av_height

Solution

F averageheightperageclass
T av_height_agecls
F height/class
F av_height

Ruthlessness

Choose how to best initialize the variables that contain the BMI (body mass index) of 17 subjects at three different time points.

  • bmi1 <- numeric(17); bmi2 <- numeric(17); bmi3 <- numeric(17)
  • bmi <- matrix(0, nrow=17, ncol=3)
  • bmi <- NULL; ind <- c(0,0,0); for (i in 1:17) bmi <- rbind(bmi, ind)

Solution

F bmi1 <- numeric(17); bmi2 <- numeric(17); bmi3 <- numeric(17)
T bmi <- matrix(0, nrow=17, ncol=3)
F bmi <- NULL; ind <- c(0,0,0); for (i in 1:17) bmi <- rbind(bmi, ind)

Special care for dates

This episode was created on February 28, 2023. Enter this date as an 8-digit integer:

Solution

20230228

Once more dates

This episode was created on February 28, 2023. Enter this date in ISO 8601 coding:

Solution

2023-02-28

Missing values

Choose all acceptable codes for missing values.

  • 99999
  • -99999
  • NA
  • ‘empty cell’
  • non detectable

Solution

F 99999
F -99999
T NA
T ‘empty cell’
F non detectable

Code styling

The preferred way of indenting code is

  • a tab
  • none
  • two spaces

Solution

F a tab
F none
T two spaces

 

Episode challenge

Improve a spreadsheet in Excel

Considering the input on data in spreadsheets try to improve the spreadsheet

trainingdata.xlsx

This spreadsheet contains data from 482 patients, two columns with dates and 8 columns with counts of two different markers in the blood on a baseline date, on day 1, 2 and 3 of a certain therapy.

Specifically you should check

  • the plausibility of all observations (e.g. value in correct range)
  • the correct and consistent format of the entries, e.g. spelling or encoding errors
  • date formats
  • the format of missing values
  • variable names
  • the overall layout of the spreadsheet (header, merged cells, entries that are not observations etc.)

Solution

No solution provided here.

Improve a spreadsheet in R

We continue to work on the spreadsheet trainingdata.xlsx. This time we use R to correct the same errors in the spreadsheet. Why do you think is it better to use R for this process?

Solution

No solution provided here.

Key Points

  • Well organized projects are easier to reproduce

  • Consistency is the most important principle for coding analyses and for preparing data

  • Transparency increases reliability and trust and also helps my future self


Facilitating reproducibility in academic publications

Overview

Teaching: 90 min
Exercises: 60-90 min
Questions
  • How does academic publishing work?

  • What is the IMRAD format?

  • What are reporting guidelines and why are they useful for reproducibility?

  • How can we judge the quality and credibility of a preprint

Objectives
  • Understand how the academic publishing process works

  • Know about IMRAD sections and detect content in articles efficiently

  • Find appropriate reporting guidelines and know their advantages

  • Review a preprint using a simple checklist

1. Primer on academic publishing

Why publish?

Results of research are published in the literature such that

but also such that

⇒ Publication advances science and the career of scientists

Where to publish?

Most scientific publications are in academic journals or books. Journals may be

There may be more than 30’000 journals publishing 2’000’000 articles, aka papers, per year.

How does the process of publication in journals work?

Authors have to follow several steps, which are in general:

  1. Carry out a study or another type of research project, write an article, select a journal
  2. Submit the article to peer review at the journal
  3. The article will be assigned an editor and undergoes formal checks
  4. The editor decides if it will be peer-reviewed or rejected directly (desk-rejection)
  5. The editor searches peer reviewers, usually at least two independent and anonymous experts
  6. The article is peer-reviewed resulting in review reports
  7. The editor assesses the reports and makes a decision among:
    • Rejection: the article cannot be published at this journal
    • Revision: the article has to undergo changes, sometimes major, before publication
    • Acceptance: the article can be published as it is, most often conditional on small cosmetic changes

The below image illustrates this process:


Image Credit: (Part of) Illustration of the academic publication process by the Center for Reproducible Science (E. Furrer), CC-BY, https://doi.org/10.5281/zenodo.7994313.

What is a doi and what is Crossref?

Since the goal of scientific articles is to contribute to the advancement of science they need to be findable and identifiable for future work. For that a necessary condition is that they have a unique identifier, which is nowadays, of course, a digital identifier.

From Wikipedia:

A digital object identifier (DOI) is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). An implementation of the Handle System, DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports, data sets, and official publications. DOIs have also been used, however, to identify other types of information resources, such as commercial videos.

Since a DOI is a unique identifier you can find any article by concatenating https://doi.org/ and the DOI of the article and pasting it in the URL field of your browser. Try it out with 10.1186/s13059-015-0850-7. DOIs are issued by, for example, Crossref:

“Crossref (formerly styled CrossRef) is an official digital object identifier (DOI) Registration Agency of the International DOI Foundation. It is run by the Publishers International Linking Association Inc. (PILA) and was launched in early 2000 as a cooperative effort among publishers to enable persistent cross-publisher citation linking in online academic journals.”

Hence Crossref is the organisation which registers most doi for academic publications. (Source Wikipedia)

Indexing of journals/publications

Indexation of a journal, i.e. the inclusion of its articles in a meta data base, is considered a reflection of the quality of the journal. Indexed journals may be of higher scientific quality as compared to non-indexed journals. Examples of indexes are:

Many universities also have in-house databases for the articles produced by their researchers: at the University of Zurich, for example, this is ZORA https://www.zora.uzh.ch/

Why is peer review part of the publication process?

Before you review for the first time see Open Reviewer Toolkit

Known issues of the process

Issues with peer review

  • Anonymity of peer reviewers but not authors
  • Conflict of interest of peer reviewers: plagiarism, delays, favouritism, biases
  • Peer reviewers may not be competent enough
  • Peer reviewers are volunteers and almost not rewarded
  • The process is slow and unpredictable
  • Increasing numbers of publications make it more and more unfeasible

Issues with the publication system

  • Sensational results are privileged over solid but less sensational research
  • Lacking equity, e.g. already published authors are given cumulatively more credit (Matthew effect)
  • Expensive either in subscription fees in order to be able read a journal or processing charges to publish open to everyone
  • Evaluation of researchers is publication based and this incentivises fast but not rigorous research (“publish or perish”)

Preprints

Preprints are a relatively new form of publication which helps to overcome some of the issues with peer review and with the publication system. See the extension of the above graphic including preprints in the publication process:

Image Credit: Illustration of the academic publication process by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994313.

See also J Berg et al. for a comment on the introduction of preprints in Biology: https://www.science.org/doi/abs/10.1126/science.aaf9133.

Quiz on academic publishing

Peer review

Which statements are correct for the practice of peer review in academic publishing?

  • peer review contributes to keep the quality of publications to high standard
  • peer reviewers are financially rewarded for their contribution
  • peer review may take a long time and its outcome does not always depend on the quality of a publication
  • peer reviewers are always objective experts not pursuing their personal interest
  • one publication is always peer reviewed by exactly one expert

Solution

T peer review contributes to keep the quality of publications to high standard
F peer reviewers are financially rewarded for their contribution
T peer review may take a long time and its outcome does not always depend on the quality of a publication
F peer reviewers are always objective experts not pursuing their personal interest
F one publication is always peer reviewed by exactly one expert

Unfairness

The publication system is unfair since authors from prestigious institutions or authors with already a lot of publication are privileged, for them it is easier to publish since editors and reviewers decide in their favor more often. Such a type of effect is not unique to academic publishing but occurs in different aspects of society.

A common name for this effect is:

Solution

Matthew effect

Preprints

Why do preprints help to overcome some of the issues with peer review and with the publication system?

Solution

Preprints avoid conflicts of interests of peer reviewers, allow certain and fast publication and are completely free of charge.

 

2. What is the IMRAD format?

What is IMRAD?

The acronym IMRAD stands for “Introduction, Methods, Results and Discussion”. IMRAD is a widespread format in the biomedical, natural and social science research literature for reports on empirical studies. It is a convenience to readers because they can easily find the specific information they may be looking for in an article. See the article of J Wu https://link.springer.com/article/10.1007/s10980-011-9674-3 for a quick overview illustration: Image credit: Illustration of the IMRAD concept, by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994280.

R Day writes about the history of scientific publication in his article, “The Origins of the Scientific Paper: The IMRAD Format”. He specifically mentions the scientific method and its cornerstone the principle of reproducibility of results. The IMRAD Format has been introduced in order to represent the steps of the scientific method.

“Eventually, in 1972, the IMRAD format became “standard” with the publication of the American National Standard for the preparation of scientific papers for written or oral presentation.”

R Day American Medical Writers Association, 1989, Vol 4, No 2., 16–18. This article is not easily obtainable online, potentially your library can obtain it for you. If this is not possible, please contact the authors of this lesson.

What is the scientific method?

The Center for Reproducible Science at the University of Zurich uses a simplified graphical representation of the scientific method in its communications: Image credit: Illustrations of meta research and the research cycle by Luca Eusebio and Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994222.

“What is the Scientific Method?” is a philosophical question that we can not answer in full detail here and it may be one of the questions to which there is no single correct answer. We will use the Stanford Encyclopedia of Philosophy definition as a first approximation:

“Often, ‘the scientific method’ is presented in textbooks and educational web pages as a fixed four or five step procedure starting from observations and description of a phenomenon and progressing over formulation of a hypothesis which explains the phenomenon, designing and conducting experiments to test the hypothesis, analyzing the results, and ending with drawing a conclusion.”

https://plato.stanford.edu/entries/scientific-method/

This view coincides with a common approach to empirical research, even if it may be an oversimplification and a strong generalization, we assume an underlying scientific process for this lesson that is close to such an approach.

What should the IMRAD sections contain?

In 1997 the International Committee of Medical Journal Editors published “Uniform Requirements” on the structure of articles:

“The text of observational and experimental articles is usually (but not necessarily) divided into sections with the headings Introduction, Methods, Results, and Discussion. Long articles may need subheadings within some sections (especially the Results and Discussion sections) to clarify their content. Other types of articles, such as case reports, reviews, and editorials, are likely to need other formats. Authors should consult individual journals for further guidance.”

The Uniform Requirements have been updated in December 2021 and the most current version can be found here: http://www.icmje.org/about-icmje/faqs/icmje-recommendations/ The 1997 version of the requirements is avaliable here: https://www.icmje.org/recommendations/archives/1997_urm.pdf.

The document contains much more than advice on structuring a manuscript, e.g. authorship roles, peer review roles etc. Please read the chapter/section “Manuscript Sections” in one of the two versions in order to get an overview of the expected content of the IMRAD sections.

There is a long list of journals that state that they follow these requirements http://www.icmje.org/journals-following-the-icmje-recommendations/

Quiz on IMRAD

Cornerstone of the scientific method

Hippocrates is credited as the discoverer of the scientific method. But he did not clearly state its cornerstone. The cornerstone of the scientific method is the:

Solution

reproducibility of results

Introduction section

The introduction section in an article following the IMRAD structure should contain?

  • a short overview over the data and main conclusions of the article
  • the purpose/objective of the presented research
  • a complete and detailed background of the wider research area

Solution

F a short overview over the data and main conclusions of the article
T the purpose/objective of the presented research
F a complete and detailed background of the wider research area

Methods section

The methods section in an article following the IMRAD structure should contain

  • a descriptive analysis of the collected data such that appropriate methods can be chosen for the analysis
  • enough information such that a reader would in theory be able to reproduce the results
  • only information that was available before data collection

Solution

F a descriptive analysis of the collected data such that appropriate methods can be chosen for the analysis
T enough information such that a reader would in theory be able to reproduce the results
T only information that was available before data collection

Statistical methods

The statistical methods subsection of the methods section in an article following the IMRAD structure should contain

  • detailed information software and packages
  • only contain p-values an no effect sizes or estimates of the precision
  • distinguish pre-specified parts of the analysis from parts that have been done in an explorative way after looking at the collected data

Solution

T detailed information software and packages
F only contain p-values an no effect sizes or estimates of the precision
T distinguish pre-specified parts of the analysis from parts that have been done in an explorative way after looking at the collected data

Discussion section

The discussion section in an article following the IMRAD structure should contain

  • limitations of the study
  • those conclusions in view of the goals of the study that are supported by the results
  • a detailed summary of all results

Solution

T limitations of the study
T those conclusions in view of the goals of the study that are supported by the results
F a detailed summary of all results

 

3. Reporting guidelines

Reporting guidelines are checklists that are based on wide agreement in a field providing more detailed guidance on the contents of IMRAD section.

Goals of reporting guidelines

The goals of Reporting Guidelines are summarized in I Simera and D Altman https://onlinelibrary.wiley.com/doi/full/10.1111/ijcp.12168. They summarize some key principles for responsible reserach reporting:

“Researchers should present their results clearly, honestly, and without fabrication, falsification or inappropriate data manipulation.”

“Researchers should strive to describe their methods clearly and unambiguously so that their findings can be confirmed by others.”

“Researchers should follow applicable reporting guidelines. Publications should provide sufficient detail to permit experiments to be repeated by other researchers.”

Good reporting is an ethical imperative

The WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Subjects states:

“Researchers, authors, sponsors, editors and publishers all have ethical obligations with regard to the publication and dissemination of the results of research. Researchers have a duty to make publicly available the results of their research on human subjects and are accountable for the completeness and accuracy of their reports. All parties should adhere to accepted guidelines for ethical reporting. Negative and inconclusive as well as positive results must be published or otherwise made publicly available. […]”

https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-for-medical-research-involving-human-subjects/

Good reporting is required by many journals

For example the Reporting requirements of the Nature Research journals aim to improve the transparency of reporting and reproducibility of published results across all areas of science. Before peer review, the corresponding author must complete an editorial policy checklist to ensure compliance with Nature Research editorial policies; where relevant, manuscripts sent for review must include completed reporting summary documents.

Nature portfolio Reporting Summary https://www.nature.com/documents/nr-reporting-summary-flat.pdf Nature Reporting requirements and reproducibility editorials https://www.nature.com/nature-portfolio/editorial-policies/reporting-standards#editorials

Database of reporting guidelines

“The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network is an international initiative that seeks to improve the reliability and value of published health research literature by promoting transparent and accurate reporting and wider use of robust reporting guidelines.”

“It is the first coordinated attempt to tackle the problems of inadequate reporting systematically and on a global scale; it advances the work done by individual groups over the last 15 years.”

http://www.equator-network.org/reporting-guidelines/

The MDAR framework

“We were motivated to develop the MDAR Framework as part of our own and others’ attempts to improve reporting to drive research improvement and ultimately greater trust in science. Existing tools, such as the ARRIVE guidelines, guidance from FAIRSharing, and the EQUATOR Network, speak to important sub-elements of biomedical research. This new MDAR Framework aims to be more general and less deep, and therefore complements these important specialist guidelines.”

M McLeod et al. https://www.pnas.org/content/118/17/e2103238118

 

Other examples of reporting guidelines

M Michel et al. http://dmd.aspetjournals.org/content/dmd/48/1/64.full.pdf

 

T Hartung et al.https://www.altex.org/index.php/altex/article/view/1229

 

S Cruz Rivera et al. https://www.nature.com/articles/s41591-020-1037-7.pdf

 

M Appelbaum https://psycnet.apa.org/fulltext/2018-00750-002.html
⇒ also available for qualitative and mixed methods

 

R Poldrack et al. https://www.sciencedirect.com/science/article/pii/S1053811907011020?via%3Dihub

 

L Riek https://dl.acm.org/doi/pdf/10.5898/JHRI.1.1.Riek

 

Benefits of reporting guidelines

Benefits for researchers

  • Guidelines helps at protocol stage, e.g. with examples how to reduce the risk of bias

  • Useful reminder of all necessary details at writing stage, especially for junior researchers

  • Appropriate reporting allows the replication or inclusion in meta research projects

  • Adherence increases chances of article acceptance at journals

 

Benefits for peer reviewers

  • Peer review is an important step but limited guidance is available

  • Key issues and methods that should be covered in an article can be found in reporting guideline

  • If journal requests a completed checklist approach even easier

  • Criticism can be justified by pointing to reporting guideline (or their explanation documents)

  • But not a guarantee for a high quality study

 

Example: reporting of methods

“t-tests were used for comparisons of continuous variables and Fisher’s Exact test or Chi-squared test (where appropriate) were used for comparisons of binary variables”

versus

“The primary outcome, time to […], was analysed using a two-sample Wilcoxon rank-sum test. The secondary outcomes of […] were analysed using the Chi square and Fisher’s exact test, respectively, and the secondary outcome of time to […] was analysed using a two- sample Wilcoxon rank-sum test. All analyses were carried out on a per protocol basis using [software version]”

courtesy of M. Schussel of the Equator Network

Quiz on reporting guidelines

Reporting guidelines

Reporting guidelines are

  • only used in biomedicine
  • based on a wide consensus of experts
  • mainly useful for the reader

Solution

F only used in biomedicine
T based on a wide consensus of experts
F mainly useful for the reader

JARS quiz 1

Look at the Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report (JARS)
M Appelbaum et al. https://doi.apa.org/fulltext/2018-00750-002.html

The guideline suggests to to group all hypotheses, analyses, and conclusions into

  • significant and non-significant
  • primary, secondary, and exploratory
  • novel, derived, and replication

Solution

F significant and non-significant
T primary, secondary, and exploratory
F novel, derived, and replication

JARS quiz 2

Look at the Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report (JARS)
M Appelbaum et al. https://doi.apa.org/fulltext/2018-00750-002.html

For publications that report on new data collections regardless of research design the guideline includes information on:

  • where to report on registration of the underlying study
  • where to report on the availabililty of data
  • where to report a manual of procedures allowing replication

Solution

T where to report on registration of the underlying study: mainly for clinical trials
T where to report on the availabililty of data: specifically only for meta analysis
T where to report a manual of procedures allowing replication: specifically for experimental studies

JARS quiz 3

Look at the Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report (JARS)
M Appelbaum et al. https://doi.apa.org/fulltext/2018-00750-002.html

In the data diagnostics and analytic strategy sections the guideline suggest that information on the following be reported

  • in which case to exclude the data of participants from the study at the analysis stage
  • how to deal with missing data
  • which precise inferential statistics procedure to use

Solution

T in which case to exclude the data of participants from the study at the analysis stage
T how to deal with missing data
F which precise inferential statistics procedure to use: the guideline mentions a strategy, i.e. not a single procedure; it also suggests that this is to be specified for each type of hypothesis

 

4. Quality and credibility of a preprint: the Precheck checklist

What is Markdown and why do we learn about it?

Markdown is a lightweight markup language for creating formatted text using a plain-text editor. The goal is an easy to write and and easy to read format, even as raw code. It is traditionally used for so-called readme files in software development and extensively as a tool to produce html code for websites. There are several flavors of the language that are used in different places but the basics are the same almost anywhere.

Since a file containing Markdown text only contains plain text and no binary information, it is a lightweight format. Moreover, changes in Markdown files are particularly easy to track.

For a reference sheet of the syntax, please see here: https://www.markdownguide.org/cheat-sheet/

We introduce Markdown here because it will be used in the following episodes of this course and we start to practice it while learning about and using the PRECHECK checklist.

Markdown

Use the online Markdown editor Dillinger to create your first Markdown document including a title, two numbered sections one containg an itemized list and the other a numbered list. Do also include bold and italics font. You can use whatever text you like if nothing else comes to mind you may simple use lorem ipsum

Solution

No solution provided here.

Introduction to PRECHECK

As we have already seen preprints are manuscripts describing scientific studies that have not been peer-reviewed, that is, checked for quality by an unbiased group of scientists in the same field.

Preprints are typically posted online on preprint servers (e.g. BioRxiv, MedRxiv, PsyRxiv) instead of scientific journals. Anyone can access and read preprints freely, but because they are not verified by the scientific community, they can be of lower quality, risking the spread of misinformation. When the COVID-19 pandemic started, a lack of understanding of preprints has led to low-quality research gaining popularity and even infiltrating public policy.

Inspired by such events, PRECHECK was created: a checklist to help assess the quality of preprints in psychology and medicine, and judge their credibility. This checklist was created with scientifically literate non-specialists in mind, such as students of medicine and psychology, and science journalists. The contents of PRECHECK are reproduced here with permission.

The checklist contains 4 items, see below or on the linked website. When using PRECHECK on a preprint read each item and the Why is this important? Section underneath each of them. Check if the preprint you are reading fulfills the item’s criteria - if yes, write down a yes for this item. In doing so use your knowledge of the IMRAD structure and smart searching on the website or the pdf.

Generally, the more “yes” on the checklist your preprint gets, the higher its quality, but this is only a superficial level of assessment. For a thorough, discriminative analysis of a preprint, you also need to consult the related Let’s dig deeper Sections underneath most items. When using the checklist, it is recommended that you have both the preprint itself as a pdf, and the webpage on the preprint server where the preprint was posted at hand. You can also check online whether the preprint has already been peer reviewed and published in a journal.

The checklist works best for studies with human subjects, using primary data (that the researchers collected themselves) or systematic reviews, meta-analyses and re-analyses of primary data. It is not ideally suited to simulation studies (where the data are computer-generated). In general, if the study sounds controversial, improbable, or too good to be true, we advise you to proceed with caution when reading the study and being especially critical.

The PRECHECK checklist

Below you find the checklist together with Why is this important? and Let’s dig deeper Sections. It can also be directly accessed in the Markdown PRECHECK checklist (without the Why is this important? Sections).

1. Research question

Is the research question/aim stated?

Why is this important?

A study cannot be done without a research question/aim. A clear and precise research question/aim is necessary for all later decisions on the design of the study. The research question/aim should ideally be part of the abstract and explained in more detail at the end of the introduction.

 

2. Study type

Is the study type mentioned in the title, abstract, introduction, or methods?

Why is this important?

For a study to be done well and to provide credible results, it has to be planned properly from the start, which includes deciding on the type of study that is best suited to address the research question/aim. There are various types of study (e.g., observational studies, randomised experiments, case studies, etc.), and knowing what type a study was can help to evaluate whether the study was good or not.

What is the study type?

Some common examples include:

  • observational studies - studies where the experimental conditions are not manipulated by the researcher and the data are collected as they become available. For example, surveying a large group of people about their symptoms is observational. So is collecting nasal swabs from all patients in a ward, without having allocated them to different pre-designed treatment groups. Analysing data from registries or records is also observational. For more information on what to look for in a preprint on a study of this type, please consult the relevant reporting guidelines: STROBE.

  • randomised experiments - studies where participants are randomly allocated to different pre-designed experimental conditions (these include Randomised controlled trials [RCTs]). For example, to test the effectiveness of a drug, patients in a ward can be randomly allocated to a group that receives the drug in question, and a group that receives standard treatment, and then followed up for signs of improvement. For more information on what to look for in a preprint on a study of this type, please consult the relevant reporting guidelines: CONSORT.

  • case studies - studies that report data from a single patient or a single group of patients. For more information on what to look for in a preprint on a study of this type, please consult the relevant reporting guidelines: CARE.

  • systematic reviews and meta-analyses - summaries of the findings of already existing, independent studies. For more information on what to look for in a preprint on a study of this type, please consult the relevant reporting guidelines: PRISMA.

Let’s dig deeper

If the study type is not explicitly stated, check whether you can identify the study type after reading the paper. Use the question below for guidance:

  • Does the study pool the results from multiple previous studies? If yes, it falls in the category systematic review/meta-analysis.
  • Does the study compare two or more experimenter-generated conditions or interventions in a randomised manner? If yes, it is a randomised experiment.
  • Does the study explore the relationship between characteristics that were not experimenter-generated? If yes, then it is an observational study
  • Does the study document one or multiple clinical cases? If yes, it is a case study.

 

3. Transparency

a. Is a protocol, study plan, or registration of the study at hand mentioned?

b. Is data sharing mentioned? Mentioning any reasons against sharing also counts as a ‘yes’. Mentioning only that data will be shared “upon request” counts as a ‘no’.

c. Is materials sharing mentioned? Mentioning any reasons against sharing also counts as a ‘yes’. Mentioning only that materials will be shared “upon request” counts as a ‘no’.

d. Does the article contain an ethics approval statement (e.g., approval granted by institution, or no approval required)?

e. Have conflicts of interest been declared? Declaring that there were none also counts.

Why is this important?

Study protocols, plans, and registrations serve to define a study’s research question, sample, and data collection method. They are usually written before the study is conducted, thus preventing researchers from changing their hypotheses based on their results, which adds credibility. Some study types, like RCT’s, must be registered.

Sharing data and materials is good scientific practice which allows people to review what was done in the study, and to try to reproduce the results. Materials refer to the tools used to conduct the study, such as code, chemicals, tests, surveys, statistical software, etc. Sometimes, authors may state that data will be “available upon request”, or during review, but that does not guarantee that they will actually share the data when asked, or after the preprint is published.

Before studies are conducted, they must get approval from an ethical review board, which ensures that no harm will come to the study participants and that their rights will not be infringed. Studies that use previously collected data do not normally need ethical approval. Ethical approval statements are normally found in the methods section.

Researchers have to declare any conflicts of interest that may have biased the way they conducted their study. For example, the research was perhaps funded by a company that produces the treatment of interest, or the researcher has received payments from that company for consultancy work. If a conflict of interest has not been declared, or if a lack of conflict of interest was declared, but a researcher’s affiliation matches with an intervention used in the study (e.g., the company that produces the drug that is found to be the most effective), that could indicate a potential conflict of interest, and a possible bias in the results. A careful check of the affiliation of the researchers can help identify potential conflicts of interest or other inconsistencies. Conflicts of interests should be declared in a dedicated section along with the contributions of each author to the paper.

Let’s dig deeper

a. Can you access the protocol/study plan (e.g., via number or hyperlink)

b. Can you access at least part of the data (e.g., via hyperlink, or on the preprint server). Not applicable in case of a valid reason for not sharing.

c. Can you access at least part of the materials (e.g., via hyperlink, or on the preprint server). Not applicable in case of a valid reason for not sharing.

d. Can the ethical approval be verified (e.g., by number). Not applicable if it is clear that no approval was needed.

By ‘access’, we mean whether you can look up and see the actual protocol, data, materials, and ethical approval. If you can, you can also look into whether it matches what is reported in the preprint.

 

4. Limitations

Are the limitations of the study addressed in the discussion/conclusion section?

Why is this important?

No research study is perfect, and it is important that researchers are transparent about the limitations of their own work. For example, many study designs cannot provide causal evidence, and some inadvertent biases in the design can skew results. Other studies are based on more or less plausible assumptions. Such issues should be discussed either in the Discussion, or even in a dedicated Limitations section.

Let’s dig deeper

Check for potential biases yourself. Here are some examples of potential sources of bias.

  1. Check the study’s sample (methods section). Do the participants represent the target population? Testing a drug only on white male British smokers over 50 is probably not going to yield useful results for everyone living in the UK, for example. How many participants were there? There is no one-size-fits-all number of participants that makes a study good, but in general, the more participants, the stronger the evidence.

  2. Was there a control group or control condition (e.g., placebo group or non-intervention condition)? If not, was there a reason? Having a control group helps to determine whether the treatment under investigation truly has an effect on an experimental group and reduces the possibility of making an erroneous conclusion. Not every study can have such controls though. Observational studies, for example, typically do not have a control group or condition, nor do case studies or reviews. If your preprint is on an observational study, case study, or review, this item may not apply.

  3. Was there randomisation? That is, was the allocation of participants or groups of participants to experimental conditions done in a random way? If not, was there a reason? Randomisation is an excellent way to ensure that differences between treatment groups are due to treatment and not confounded by other factors. For example, if different treatments are given to patients based on their disease severity, and not at random, then the results could be due to either treatment effects or disease severity effects, or an interaction - we cannot know. However, some studies, like observational studies, case studies, or reviews, do not require randomisation. If your preprint is on an observational study, case study, or review, this item may not apply.

  4. Was there blinding? Blinding means that some or all people involved in the study did not know how participants were assigned to experimental conditions. For example, if participants in a study do not know whether they are being administered a drug or a sham medication, the researchers can control for the placebo effect (people feeling better even after fake medication because of their expectation to get better). However, blinding is not always possible and cannot be applied in observational studies or reanalyses of existing non-blinded data, for example. If your preprint is on an observational study, case study, or review, this item may not apply).

Episode challenge

Use PRECHECK for two preprints

Question 1

Select two preprints in psychology and/or medicine that include human subjects. At least one of them should correspond to a preregistration. Review them using the checklist directly in the Markdown file. Create one Markdown file for the assessment of both preprints by copy pasting. Add clear sections for each preprint using Markdown syntax.

Answer at least all yes/no question for each preprint, add an explanation for your choice of answer.

Note: Use your knowledge of the IMRAD structure for smart searching in the articles, the idea is not that you read both articles in detail. Use a search strategy within the pdfs with appropriate terms. Do also look on the preprint servers, they do sometimes contain information that is not directly in the article (e.g. regarding data sharing).

Question 2

Find the preregistration(s) of the preprint and compare.

Question 3

Find the pubilication(s) of the preprint and compare.

Question 4

Look at the preprint server that has been used: what do you find on reporting guidelines? Which reporting guideline would be pertinent for this article? Do you find something on reporting guidelines in the published article? Do you find the policy of the journal regarding reporting guidelines?

Question 5

Find the journal policy on data sharing and comment.

Solution

No solution provided here.

Key Points

  • The structure of an article represents the steps of the scientific method

  • The structure of an article helps in finding information and to get started with reproduction/replication

  • There are some simple questions that can be asked when judging the quality of an article


Collaboration drives Open Science and is a challenge for reproducibility

Overview

Teaching: 45 min
Exercises: 60 min
Questions
  • Why is collaborative work especially important for Open and Reproducible Science?

  • What are tools that faciliate collaborative work?

Objectives
  • Learn about Open Science at CERN

  • Get to know OSF as one tool

  • Learn about version control systems

1. Use case: Open Science at CERN

The world wide web was invented at CERN and its leadership was visionary in making sure that the technology would be licensed under an open-source framework “with the explicit goal of preventing private companies from turning it into proprietary software”

CERN and the particle physics community are trailblazers of the Open Science movement.

We aim to look at CERN’s approach to Open Science by reading three articles that appeared in the CERN Courier in 2019, see excerpts below and use the links to read the full articles.

Open science: a vision for collaborative, reproducible and reusable research

Image Credit: T.Simko.

“True open science demands more than simply making data available: it needs to concern itself with providing information on how to repeat or verify an analysis performed over given datasets, producing results that can be reused by others for comparison, confirmation or simply for deeper understanding and inspiration. This requires runnable examples of how the research was performed, accompanied by software, documentation, runnable scripts, notebooks, workflows and compute environments. It is often too late to try to document research in such detail once it has been published.”

S Dallmeier-Tiessen and T Simko https://cerncourier.com/a/open-science-a-vision-for-collaborative-reproducible-and-reusable-research/

Inspired by software

Image Credit: S Kulkarni.

"”The underlying ideal is open collaboration: peers freely, collectively and publicly build software solutions. A second ideal is recognition, in which credit for the contributions made by individuals and organisations worldwide is openly acknowledged. A third ideal concerns rights, specifically the so-called four freedoms granted to users: to use the software for any purpose; to study the source code to understand how it works; to share and redistribute the software; and to improve the software and share the improvements with the community. Users and developers therefore contribute to a virtuous circle in which software is continuously improved and shared towards a common good, minimising vendor lock-in for users.”

G Tenaglia and T Smith https://cerncourier.com/a/inspired-by-software/

Preserving the legacy of particle physics

Image Credit: https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/ with original CC-By in Phys. Lett. B 716 30.

“CMS aims to release half of each year’s level-three data three years after data taking, and 100% of the data within a ten-year window. By guaranteeing that people outside CMS can use these data, says Lassila-Perini, the collaboration can ensure that the knowledge of how to analyse the data is not lost, while allowing people outside CMS to look for things the collaboration might not have time for. To allow external re-use of the data, CMS released appropriate metadata as well as analysis examples.”

A Rao https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/.

More information can be found in the article “Open is not enough” by X Chen et al. https://www.nature.com/articles/s41567-018-0342-2.

Open Science is about collaboration

  • Collaborative research becomes more and more important since complex challenges require a diverse team science approach, e.g. particle physics, drug development, big data projects in medicine or social science etc.

  • Collaborative research entails specific practical issues that may affect reproducibility when different versions of files are worked on by several collaborators.

  • Collaborative tools can be used to make research accessible to the public beyond publications, e.g. protocols, code, data.

Quiz on open science at CERN

Reana

CERN’s REANA can be used to

  • publish finished analysis results
  • submit parameterised computational workflows to run on remote compute clouds
  • reinterpret preserved analyses
  • run “active” analyses before they are published

Solution

F publish finished analysis results
T submit parameterised computational workflows to run on remote compute clouds
T reinterpret preserved analyses
T run “active” analyses before they are published

Software

Having experienced first-hand its potential to connect physicists around the globe, in 1993 CERN released the web software into the:

Solution

public domain

Levels of open data at CERN

The four main LHC experiments have started to periodically release their data in an open manner, and these data can be classified into four levels. Check the correct level descriptions

  • The first level consists of the numerical data underlying publications.
  • The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
  • The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
  • The fourth level is the raw data generated by the detectors.

Solution

F The first level consists of the numerical data underlying publications.
T The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
T The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
T The fourth level is the raw data generated by the detectors.

 

2. Some tools for collaboration

Open Science framework

Image Credit: https://thriv.virginia.edu/center-for-open-science-open-science-framework/

  • The framework is developed by the Center for Open Science (COS), a non-profit organisation in the USA with the mission to increase the openness reproducibility and integrity of scientific research

  • The main tool that they build and maintain is the Open Science framework, called OSF, which is a free and open-source tool.

  • The design principle of the tool is to make it easy to practice open and reproducible research practices at all of the many stages of the research lifecycle.

  • Researchers are encouraged through the framework to start engaging with the idea of what material to share systematically and early on. Sharing publicly but also with collaborators before the manuscript editing phase.

Introduction to the OSF

One of the best ways to learn about the OSF is through a video provided by COS. The video is long (although you may watch it a increased speed), make sure you learn about the following features of OSF:

You can find the video here.

Other tools: CRS primer

Other tools for collaboration have been summarized in the Primer “Digital collaboration” by the Center for Reproducible Science at the University of Zurich. The primer contains a few University of Zurich specific recommendations but is mostly applicable for anyone.

Quiz on the open science framework as a collaborative tool

Global unique identifiers

OSF distributes global unique identifiers

  • only at the project level
  • at the project, component and file level
  • each time you make changes to a project

Solution

F only at the project level
T at the project, component and file level
F each time you make changes to a project

Wiki syntax

Wikis on OSF can be written in a “what you see is what you get” way and using the syntax of:

Solution

Markdown

Version control in OSF

Binary file types such as Word files or pdfs are version controlled on OSF through

  • use of an online editor
  • adding version indicators to file names
  • recognition of file names in components

Solution

F use of an online editor
F adding version indicators to file names
T recognition of file names in components

 

3. What is version control and what is Git?

The purpose of Git is best explained with the below cartoon: it is a system that allows to avoid situations like in the cartoon. Such systems are called version control systems since they are designed to take care of versioning of files without changing (and lengthening) file names. The purpose here is not to teach you Git but to inform you detailed enough such that you can decide if or if not you need to learn Git. You will also learn a bit of terminology and get some links such that a start with Git should be easier.

Image Credit: “Piled Higher and Deeper” by Jorge Cham at https://www.phdcomics.com. (permission requested)

What is Git?

  • Git is:
    • a version control system, i.e., tracks changes incl. timestamps
    • the de facto standard
    • open source, developed by Linus Torvalds in 2005
  • Git runs on all major operating systems
  • Several IDEs (Integrated Development Environments) available:
    • RStudio
    • Eclipse https://www.eclipse.org/

Git has a reputation to be complicated

Git is a tool that originated in software development and hence in order to use it a certain computer skill level is necessary. As a result it has a reputation to be complicated. Similar to code-based analysis there is indeed a certain learning curve in the beginning but with just a bit of practice the advantages outweigh the initial investment.

Image Credit: Randall Munroe/xkcd at https://xkcd.com/21597/ licensed as CC BY-NC.

Why use it anyhow?

  • It provides a completely documented past
  • Collaborators have coordinated access to the same documents
  • It allows easy synchronization for local files (offline working)
  • Tools to resolve conflicts for text based files are available
  • and of course one can avoid file names like masterManuscript_v4_rf_0812_gh.doc
  • (further benefits on a code development level)

More terminology

Git is a decentralized version control system

Developers work directly with their own local repository on their computer, i.e. a folder on their computer: the local workspace in the below graphic.

By using the command “git add” they add files or folders to the local index, which is similar to a registry in a library. This step is also called staging. Then they “git commit” their staged changes to the local repository on their computer creating a version that will be kept in the system. Only with the “git push” command do they upload the changes to the remote server.

The next person working on the same repository will need to “git pull” the updated repository in order to access the changes.

Image Credit: Illustration of the most important git commands by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994551.

Installation

Open Rstudio, go to the terminal tab and type git --version to check if you already have Git installed.

If you do not have it go to https://git-scm.com/downloads and choose the correct operating system for the download.

When you are ready to run Git locally on your computer you can start using it together with a remote service (see above).

Want to know more?

Quiz on Git

Advantages

Why is a version control system useful when working on analysis scripts, even if working alone.

  • Git allows you to review the history of your project.
  • If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
  • Git retains local copies of repositories, resulting in fast operations.
  • Git automatically scans for all changes and hence tracks everything.

Solution

T Git allows you to review the history of your project.
T If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
F Git retains local copies of repositories, resulting in fast operations.
F Git automatically scans for all changes and hence tracks everything.

 

Episode challenge

In-classtask

First, we will add all participants to a common OSF project.

Task 1

We work on publicly available data from 13 weather stations in Switzerland: Sunshine duration, precipitation, temperature and fresh snow (1931 – 2022) and Ice days, frost days, summer days, heat days, tropical nights, precipitation days (1959 – 2022). We will collaboratively summarize the data into (approximatively) 30 year averages (see below) per station for each of the 10 available characteristics. Create a corresponding csv file with 14 columns including a column identifying characteristic and time period and one column per station and upload it to an OSF project in which all participants are members. Distribute the work of calculating averages and putting them into the common csv file in the group of participants.

Note: use the approximatve 30 year periods 1931 – 1958, 1959 – 1988, 1989 – 2022

Task 2

Add a Readme file to the project describing the content of the project, all participants agree on the wording and correct if necessary. Discuss the following questions. 1. What are the main difficulties when collaboratively editing the same file(s)? 2. What are the advantages in using text based files such as .R, .csv and .md files? and add the group’s thought in the Readme.

Key Points

  • Collaboration is fundamental for science, especially Open Science

  • Learning to use tools for collaboration is effective and helps to avoid problems


Reproducible notebooks for data analysis

Overview

Teaching: 60 min
Exercises: 120-150 min
Questions
  • Should I use a graphical user interface to analyse data or a code-based system?

  • What is literate programming and what is R Markdown?

  • How do I use R Markdown

Objectives
  • Understand the advantages of code-based data analysis

  • Able to create and adapt an R Markdown script

  • Adapt a YAML header

  • Use code chunks and choose the right chunk options

  • Practice R Markdown by answering simple questions

  • Bonus: add a table to an R Markdown script

1. Reproducibility of analysis code

Graphical user interface (GUI) vs. code

Statistical analysis software packages with graphical user interfaces such as SPSS (often) provide an easy solution to data analysis because

But it is often not clear anymore how exactly (and with which commands) those results were obtained and writing all steps down can be very tedious. Even if some kind of history of the executed commands is saved it might still be necessary to clean this up and keep only the relevant steps..

An analysis that is not reproducible can be an issue since

On the other hand doing a statistical analysis purely with code is

However, while it seems like a big hurdle modern programming languages designed for statistical computing, such as R, are usually pretty straightforward to learn and use, and they have a lot of advantages:

R projects

Another very useful concept that enhances R’s usefulness via Rstudio are R Projects. They allow to

Connected to the use of Project is the concept that you use relative file paths (e.g. for loading csv files). So instead of doing something like this: read.csv("/home/user/Documents/Uni/UnderstandingORS/Topic_5/data/example.csv") you write read.csv("data/example.csv"). This is easier to write, more flexible and less prone to errors because as long as you keep your files in the project together it will work. Imagine, for example, you want to move your script (and data) to "/home/user/backup/Uni/UnderstandingORS/Topic_5/data/example.csv".

Quiz on R projects

File path

Suppose your current working directory is ~/project and you want to specify the relative path to the file ~/project/data/data.csv . What are possible specifications?

  • data.csv
  • project/data.csv
  • project/data/data.csv
  • data/data.csv

Solution

F data.csv
F project/data.csv
F project/data/data.csv
T data/data.csv

2. Literate programming and R Markdown

R Markdown is a realisation of the literate programming concept mixing narrative text with analysis code which is then rendered into formatted text and analysis results (numbers, tables and graphics). The concept of literate programming goes back to Donald Knuth, see e.g. from the open-science-with-r carpentries course:

_ More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm: _

  • Do not write your program only for R but think also of code readers (that includes you).
  • Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.
  • Explain the “why” and not the “how”.
  • Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.

Parts of an R Markdown (.Rmd) file

Create a new Rmd

Execute the following steps on your computer while you read:

In Rstudio

A new .Rmd file should open with a short tutorial.

To render, or knit, the file to html press Knit. The first time you run the script you will have to specify the name under which to save it. Afterwards the script is always saved before rendering.

YAML header

The YAML header of an R Markdown files contains the meta data that influence the final document in different ways. See the short summary from the open-science-with-r carpentries course:

_ The header of your R Markdown document will allow you to personalize the related report from your R Markdown document. The header follows the YAML syntax (“YAML Ain’t Markup Language”) which usually follows a key:value syntax. _

For example, title: "titlename", where title is the key and "titlename" is the value.

The header itself starts with --- and stops with ---. For example:

---
title: "titlename"
output: html_output
---

More information about the YAML header can be found in the R Markdown cheat sheet.

Code chunks

The narrative text of a report is written in the simple Markdown syntax. Code chunks are specific to R Markdown. They contain R code that is to be executed when rendering the chunk or the entire file, i.e. including the data analysis.

To start a chunk write (backticks) ` {r} `, then place your R code and end the chunk with ` . The r in ```{r} ` indicates that the programming language used in this chunk is R. Other options include python or bash although we will not need these here.

Within RStudio a new code chunk can be included by either clicking on Insert a new code chunk in the toolbar or using a keyboard shortcut (Ctrl+Alt+I on Windows and Option+Command+I on Mac).

Each chunk can be run separately. To run the code in an individual chunk click on the green arrow (Run Current Chunk) on the right side of the chunk. Alternatively use the keyboard shortcut Ctrl+Alt+T (Windows) or Option+Command+T (Mac) to run the current chunk (i.e. where your cursor is located). This runs only the code in the specific chunk but does not render the entire file.

For more options see the cheat sheet in R studio: Help > Cheat Sheets > R Markdown Cheat Sheet or the link above.

The behavior of code chunks can be changed by setting chunk options. This is done in the opening of the chunk, e.g. ` ```{r, echo=FALSE}`, which hides the code of this chunk (while still evaluating it). For more options see the R Markdown cheat sheet or the R Markdown Cookbook.

  Note: inline R code, i.e. code directly within the narrative text, can be run with `r ` , placing the code after r . This is for example useful when you mention a sample size in the text and want it to update directly from the data set you read in.

Quiz on literate programming and R Markdown

Literate programming

Which of the following statements about literate programming are true?

  • Literate programming combines code and text.
  • Literate programming makes your code run more efficient.
  • Literate programming makes your analysis easier to understand.
  • Code should only be shown if necessary.
  • Plots should not be included.

Solution

T Literate programming combines code and text.
F Literate programming makes your code run more efficient.
T Literate programming makes your analysis easier to understand.
F Code should only be shown if necessary.
F Plots should not be included.

YAML header: author

How do you specify the author in an R Markdown document?

  • author: “name”
  • Author: “name”
  • Author = “name”
  • Author: ‘name’
  • author: name

Solution

T author: “name”
F Author: “name”
F Author = “name”
F Author: ‘name’
T author: name

YAML header: date

How do you set the date?

  • Date: 01/01/2021
  • Date = 01/01/2021
  • datum: 01/01/2021
  • date: 01/01/2021

Solution

F Date: 01/01/2021
F Date = 01/01/2021
F datum: 01/01/2021
T date: 01/01/2021

Chunk options

How do you prevent code from evaluation in a chunk?

  • evaluate=FALSE
  • eval=FALSE
  • noeval=TRUE
  • hinder=TRUE
  • hind=TRUE
  • interpret=FALSE
  • inpr=FALSE

Solution

F evaluate=FALSE
T eval=FALSE
F noeval=TRUE
F hinder=TRUE
F hind=TRUE
F interpret=FALSE
F inpr=FALSE

Chunk options: figure height

How do you adjust the figure height?

  • figure_height=100
  • figureheight=100
  • heightfigure=100
  • fig_height=100
  • fig.height=100
  • height.fig=100

Solution

F figure_height=100
F figureheight=100
F heightfigure=100
F fig_height=100
T fig.height=100
F height.fig=100

R Markdown Practice

Modify the template found here by performing the following steps:

  • add author and date to YAML header
  • rename the first chunk to print_chunk_nr_1
  • set the chunk options of chunk 2 to not show the code
  • set the chunk options of chunk 3 to not evaluate the code but show it
  • set the chunk options of chunk 4 to not show the warning
  • complete the sentence at the end with appropriate information calculated in chunk 5

After these steps answer the questions:

  1. The percentage of children who survived the Titanic accident was (Note: round to one decimal digit)
  2. The percentage of female survivors was ___ times higher as the percentage of male survivors. (Note: round to two decimal digits)

Solution

RMarkdown solution

  1. 52.3
  2. 3.45

 

Episode challenge

The goal of this challenge is to create a fully reproducible analysis within an R Markdown script which is easy to understand and read. For that, describe what you do and do not forget nice formatting. For example try different Markdown syntax (Headers, …), figure captions (Hint: check the chunk options for this), a meaningful YAML header, etc.

Analysis of the palmer penguins data

Create a new R Markdown document and write the code for each of the below questions in a separate paragraph/chunk and describe the result in a complete sentence directly in the R Markdown document. We will use the penguins dataset from the package palmerpenguins, available as the data set penguins after installing the package. To get an overview of the data load the penguins dataset and explore the following questions:

 

Question 1

Find the the source of the penguins dataset together with the url to the data repository. Hint: run ?penguins

Solution 1

library(palmerpenguins)

Can I put ?penguins in the chunk?

The source is: Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f

Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. doi: 10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689

Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. doi: 10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e < Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

 

Question 2

Create a Markdown table describing the penguins data: for each numeric column in the dataset create a row in the table which should consist of the following columns: Column_Name, Mean, Variance
Hint: Checkout the function knitr::kable and the chunk option results='asis'.

Solution 2

numericcols <- sapply(colnames(penguins), function(x) is.numeric(penguins[[x]]))
df <- data.frame(Column_Name = names(numericcols)[numericcols],
                 Mean = signif(apply((na.omit(penguins[numericcols])), 2, mean), 4),
                 Variance = signif(apply((na.omit(penguins[numericcols])), 2, var), 4),
                 row.names = NULL
                 )
knitr::kable(df)
Column_Name Mean Variance
bill_length_mm 43.92 2.981e+01
bill_depth_mm 17.15 3.900e+00
flipper_length_mm 200.90 1.977e+02
body_mass_g 4202.00 6.431e+05
year 2008.00 6.678e-01

 

Question 3

How many rows does the penguins dataset have?

Solution 3

result_question_3 <- dim(penguins)[1]

The data set has 344 rows.

 

Question 4

What is the first year of records in the data set?

Solution 4

result_question_4 <- min(penguins$year)

The first year of records is 2007.

 

Question 5

What is the total number of Adelie penguins?

Solution 5

result_question_5 <- sum(penguins$species == "Adelie")

The total number of Adelie penguins is 152

 

Question 6

What is the total number of missing values (NA)?

Solution 6

result_question_6 <- sum(is.na(penguins))

The total number of missing values (NA’s) is 19.

 

Question 7

What is the total number of rows with no missing values?

Solution 7

result_question_7 <- sum(apply(penguins, 1, function(x) !any(is.na(x))))

The number of complete rows (rows with no missing values, i.e. NA’s) is rresult_question_7.

 

Question 8

On which islands were the Gentoo penguins found?

Solution 8

result_question_8 <- unique(penguins$island[penguins$species == "Gentoo"]);

The name of islands where the Gentoo penguins were found is Biscoe.

 

Question 9

What is the proportion of Adelie penguins on Dream island (compared to all penguins on Dream island)?

Solution 9

result_question_9 <- sum(penguins$species == "Adelie" & 
                           penguins$island == "Dream") / 
                      sum(penguins$island == "Dream")

The proportion of Adelie penguins on Dream island is 0.4516129.

 

Question 10

What is the 93% quantile of the bill lengths in mm?

Solution 10

result_question_10 <- quantile(na.omit(penguins$bill_length_mm), 0.93)

The 93 % quantile of bill_length_mm is 51.3.

 

Question 11

What is the absolute mean difference of bill depth in mm between female and male penguins?

Solution 11

result_question_11 <- abs(coef(lm(bill_depth_mm ~ sex, penguins))[2])

The absolute mean difference of bill_depth_mm between female and male is 1.4656169.

 

Question 12

What is the 95% confidence interval of the slope of the linear regression with intercept of bill depth regressed on sex? Result will be a vector of two elements, e.g. c('lower_limit', 'upper_limit').

Solution 12

result_question_12 <- confint(lm(bill_depth_mm ~ sex, penguins), "sexmale" )

The 95% confidence interval of the slope of the linear regression between bill_depth_mm and sex is 1.0710254, 1.8602083

 

Question 13

What is the proportion of Chinstrap penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45 compared to all penguins with flipper length in mm smaller than 205 and bill length in mm larger than 45?

Solution 13

chins <- na.omit(penguins$species[penguins$flipper_length_mm < 205 & 
                                    penguins$bill_length_mm > 45])
result_question_13 <- sum(chins == "Chinstrap") / length(chins)

The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 is 0.9310345.

 

Question 14

What is the proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins?

Solution 14

result_question_14 <- sum(chins == "Chinstrap") / sum(penguins$species == "Chinstrap")

The proportion of Chinstrap penguins with flipper_length_mm smaller than 205 and bill_length_mm larger than 45 compared to all Chinstrap penguins is 0.7941176.

 

Bonus challenge

R Markdown tables

For the following challenge we will use the package kableExtra which extends the base capabilities of knitr::kable to create tables. From the package vignette:

_ The goal of kableExtra is to help you build common complex tables and manipulate table styles. It imports the pipe %>% symbol from magrittr and verbalize all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2 and plotly._

For users who are not very familiar with the pipe operator %>% in R: it is the R version of the fluent interface. The idea is to pass the result along the chain for a more literal coding experience. Basically, when we say A %>% B, technically it means sending the results of A to B as B’s first argument.

Simple tables can be generated as follows:

library(kableExtra)
head(penguins) %>% # the dataset, '%>%' parses the output of this command as the input in the next >command 
 kbl() %>% # the kableExtra equivalent of knitr::kable, base table
 kable_classic() # add theme to table
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

For all options check the documentation or the vignette.

 

Task 1 Create the following table:

bill length [mm]
bill depth [mm]
species 2007 2008 2009 2007 2008 2009
Adelie 38.8 38.6 39.0 18.8 18.2 18.1
Chinstrap 48.7 48.7 49.1 18.5 18.4 18.3
Gentoo 47.0 46.9 48.5 14.7 14.9 15.3

Hint: checkout the different styling functions, e.g. kable_classic.
Hint: For multiple column names use add_header_above
Hint: Use the following code to get started.

Solution

df_sum <- penguins %>% 
  dplyr::select(-sex, -island, -flipper_length_mm, -body_mass_g) %>% 
  dplyr::group_by(species, year) %>% 
  dplyr::summarise(dplyr::across(.fns = function(x) signif(mean(na.omit(x)), 3))) %>% 
  tidyr::pivot_wider(names_from = c(year), values_from = c(bill_length_mm, bill_depth_mm)) 

df_sum %>%
  kbl(col.names = c("species", rep(c("2007", "2008", "2009"), 2)))%>%
  kable_classic() %>%
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3)) %>%
  kable_styling(bootstrap_options = c("hover")) %>% 
  column_spec (c(1, 4), border_right = T) 

 

Task 2 Create the following table wich includes small graphs:

bill length [mm]
bill depth [mm]
species mean boxplot histogram mean boxplot histogram
Adelie 38.8 18.3
Chinstrap 48.8 18.4
Gentoo 47.5 15.0

Hint: Use column_spec for altering specific columns.

Solution

df_sum <- penguins %>% 
  dplyr::select(-island, -sex, -year, -body_mass_g, -flipper_length_mm) %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(dplyr::across(.cols = !contains("species"),
                                 .fns = function(x) 
                                   signif(mean(na.omit(x)), 3))) %>% 
  dplyr::mutate(bill_length_boxplot = "", bill_length_hist = "",
                bill_depth_boxplot = "", bill_depth_hist = "")

dfsum_list <- split(penguins$bill_length_mm, penguins$species)
dfsum_list2 <- split(penguins$bill_depth_mm, penguins$species)
df_sum %>%
  dplyr::select(species, 
                dplyr::starts_with("bill_length"), 
                dplyr::starts_with("bill_depth")) %>% 
  kbl(col.names = c("species", rep(c("mean", "boxplot", "histogram"), 2))) %>%
  kable_paper() %>%
  column_spec(1, border_right = TRUE) %>%
  column_spec(3, image = spec_boxplot(dfsum_list)) %>%
  column_spec(4, image = spec_hist(dfsum_list), border_right = T) %>%
  column_spec(6, image = spec_boxplot(dfsum_list2)) %>% 
  column_spec(7, image = spec_hist(dfsum_list2)) %>% 
  add_header_above(c(" " = 1, "bill length [mm]" = 3, "bill depth [mm]" = 3),
                   border_right = TRUE, border_left = TRUE)

Key Points

  • Code-based analysis is better for reproducibility.

  • Combining narrative and code-based results is even more profitable.

  • Code chunks in R Markdown provide an easy solution


Reproducible and honest visualizations

Overview

Teaching: 90 min
Exercises: 90-120 min
Questions
  • How to create graphs reproducibly?

  • How to transmit information truthfully in graphs?

  • What are the good practice principles for visualizations?

Objectives
  • Learn about dos and donts for honest graphs

  • Learn about good practice for visualizations

  • Apply the principles to concrete examples

  • Learn the necessary R code

Data visualization

Data visualization is omnipresent in science. Visualizations range from presenting raw data to illustrating analysis results or modeling outcomes. The way visualizations are constructed should, as any other part of the analysis, be reproducible and adhering to the basic principles of good scientific practice. For visualizations it is specifically important to honestly show the data without distorting the contained information towards an intended message, i.e. we present how to transmit information as truthfully as possible in graphs. You will practice reproducible data analysis skills while learning about best practice for graphs.

Good practice for visualizations

In the following sections we will have a look at different visualizations and things to be aware of when using them with the goal of transmitting information truthfully. The most important principles of good practice for visualizations are

1. Be simple, clear and to the point
2. Show the data
3. Be honest about the axes
4. Use colors sensibly

We provide code in ggplot. A short introduction to ggplot is provided in this episode of the carpentries course: R for Social Scientists. A good reference for both systems, plots in base R and ggplot is the book by RD Peng: Exploratory data analysis with R.

1. Be simple, clear and to the point

Encoding data using visual cues

As a basic principle it is useful to consider the relationship of visual cues, i.e. the type of visual encoding of quantitative data such as bars or areas, and the accuracy of the understanding of a viewer of these visualizations. The graph below shows how accurately the visualizations are perceived for different types of representation. Lengths (in form of bars) represent the data most accurately while volumes are rather generic and are more difficult to be perceived accurately.

Image credit: P Aldhous

The linked picture is based on Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods by William S. Cleveland and Robert McGill.

Therefore, when creating a visualization you should consider the best type of visual cue that represents the data best with the goal of transmitting the intended message. For good perception of a message it is clearly better to provide simple visualizations. We discuss some specific points in more detail below.

3D

Providing simple and easily perceptible visualizations implies that you should avoid 3-dimensional graphical representations in most circumstances. Consider the following visualization:

Image credit: Josh Taylor

As you can see (or not see!) some data is hidden behind the different bars. Furthermore it is rather difficult (and misleading) to compare the height from different depths.

Avoid occlusion

As a general principle we can conclude from the 3D example that you should always avoid occlusion of some parts of the visualization. An example can be found in the following plot showing multiple densities in the same panel. The different densities where colored according to group but only the density in the front is fully visible.

An alternative is to plot lines which allows us to see all groups completely.

Pie charts

Pie charts can be considered an alternative to bar charts, although often not a good one since they use angles as visual cues. For instance look at the following three visualizations of a change of the count of three factors over time. First a barplot, second a stacked barplot and lastly three pie charts (on top of each other). Where are differences most visible?

Arrangement of plots

The arrangement of multiple plots and panels can also contribute to increasing the clarity of a visualization. Have a look at the following plot.

Two inconsistencies are present. First of all the order of the sample of the top row and the bottom row is not the same. Secondly in the top row var1 is on the y-axis while in the bottom row it is on the x-axis. Staying consistent and in general have an arrangement that makes sense helps to have a clear representation that transmits the desired information efficiently. A better alternative for the above plot is:

2. Show the data

Boxplots

Boxplots are used to give a rough overview of the distribution of a data set based on a few summary characteristics (quantiles). Consider the following three boxplots each representing a different dataset. The boxplots look identical even tough the underlying distributions may not be.

The code for the above plot:

ggplot(df_long) +
  geom_boxplot(aes(y = y, x = dataset))

Violin plots are an alternative to boxplots. They are based on an estimation of the underlying probability density, i.e. they use more information inherent in the data set. Have a look at the following three violin plots of the same datasets as above. Again, two of the violin plots look similar but the underlying data may not be identical.

Let’s finally have a look at the actual data. As you can see the samples x1 and x3 are in fact very distinct, or more precisely, x3 seems to have only 5 possible values.

So why did the boxplot not show the distributional differences? Since boxplots only show certain quantiles (usually the quartiles, i.e., 25%, 50% and 75%, plus “outliers”) plots of different datasets having the same or similar quantiles appear identical. The quartiles of the three data sets are

dataset q25 q50 q75
x1 -2.1 0 2.1
x2 -2.0 0 2.0
x3 -2.0 0 2.0

Violin plots show a mirrored estimation of the underlying density using a smoothing technique. Loosely speaking smoothing means that starting from a histogram a smooth version of the underlying probability distribution is created. The degree of smoothing, ranging in this case from histogram to straight line, determines the actual appearence of the plot. For the violin plot the degree of smoothing is chosen automatically. Already histograms with 5 bins for the data x1 and x3 would be very similar and hence the smoothed versions thereof as well.

Caution is furthermore advised if the datasets that are compared are of very different size, because often more data gives you a higher confidence in the observed distribution. It is therefore advised to initially always have a look at the actual data and not just the summaries (like boxplots and violin plots) to detect anomalies.

Another option is the use of geom_jitter (or geom_sina from the ggforce package) in combination with violin plots:

The code for the above plot:

ggplot(df_long) +
  geom_violin(aes(y = y, x = dataset)) +
  geom_jitter(aes(y = y, x = dataset), width = 0.3)

The advantage is that individual points as well as the distribution are shown.

Boxplots together with geom_jitter are another possibility.

Another possibility is to only show the jittered data:

Bar plots

The same as discussed before for boxplots also holds for barplots. If you have continuous data and see the following barplots you might conclude that the data sets are the same:

But if you also show the individual points you can see clear differences:

Important to keep in mind when using barplots with error bars is to state what the error bars mean. Do they correspond to the standard deviation, the standard error or a confidence interval? There is no clear answer to which one to use and, if possible, other types of visualizations should be used.

3. Be honest about the axes

The axes of plots determine how much information you provide and where you put the focus. You could cut axes, blow certain parts of an axis up through transformation or hide information on certain scales if you do not transform. You can expose or hide information by choosing the aspect ratio between the x and y axis. You can provide clear and precise information through meaningful labeling of axes and axis tick marks or you can obscure the same information by deliberately choosing uninformative tick locations, for example. These issues are illustrated through example in the following

Cutting axes

Let’s consider the following two barplots. The first has a shortened axis range and shows clear differences between the datasets. The second plot on the other hand shows the enire axis starting from zero and the differences disappear.

See this concrete example of cutting an axis, which makes differences appear much huger than they are in reality:

Image credit: Helena Jambor

Axis transformation

In some cases you might have data that is on completely different scales, meaning that there are differences to be seen at different orders of magnitudes. In these cases it can often to be helpful to do an axis-transformation. For instance consider the following untransformed plot:

There seems to be some structure but especially for the low values it is not clear what is going on. If instead you do a log10 transformation of the x-axis things get much clearer. Axis transformations are also something to consider if you have for example non linear scales. But beware, transformations can also be used to showcase differences that do not really matter in practice.

Aspect ratio

The aspect ratio is another important parameter that can be manipulated to overstress certain patterns. For example, have a look at the following two plots. The first as a ratio of one, meaning the scale of the x and y axis are the same. The second plot has an aspect ration of 1/4 meaning the x axis is substantially longer.

Code for the above plot:

ggplot(df) +
  geom_point(aes(x, y)) +
  coord_fixed(ratio = 1)
ggplot(df) +
  geom_point(aes(x, y)) +
  coord_fixed(ratio = 1 / 4)

Visually the second plot implies that the variance of x is much higher than of y, which is not the case:

summarise(df, x = var(x), y = var(y))
          x         y
1 0.8332328 0.9350631

Also consider the following real example. Where does the increase look the most dramatic?

Image credit: Demetris Christodoulou

Bin width of histograms

The appearance of a histogram is determined by the bin width that is used to create it. If you have a very large binwidth (or a low total number of bins) you might see something like this and you would probably consider the distribution to be approximately uniformly distributed.

If on the other hand you decrease the binwidth (or increase the number of bins) you might see something like this:

Making it quite obvious that the distribution is most definitely not uniformly distributed (on this scale). Choosing the correct bin width is not easy and depends largely on the context.

With geom_rug you can mark the position of individual observations:

Code for the above plot:

ggplot(df, aes(x)) +
  geom_histogram(binwidth = 0.5) +
  geom_rug()

Axis scales in multiple plots

If you provide plots in multiple panels, each using the same variables, you need to pay attention to the scale of each subplot. For example have a look at the following plot.

At first glance the distribution of each of the three samples looks the same. But if you look closely you can see that the scales are not the same for each subplot. If you instead keep the scale the same across subplots you get a visualization with clear differences of the distributions between the different samples.

Code for the above plot:

ggplot(df) +
  geom_point(aes(x, y)) +
  facet_wrap( ~ sample)

4. Use colors sensibly

Color contrast

Trying to encode more than 8 category with colors is usually not a good idea as distinction between colors can become very difficult:

In such a case it can be a better idea to directly label the points:

Code for the above plot:

mtcars %>% 
  rownames_to_column() %>% 
  ggplot() +
  geom_point(aes(mpg, disp, color = cyl)) +
  ggrepel::geom_label_repel(aes(mpg, disp, label = rowname),
                            size = 2.5, label.size = 0.1,
                            label.padding = 0.1)

See also: Common pitfalls of color use in Fundamentals of Data Visualization.

Color vision deficiency

About 1 of every 12 people is affected by some type of color vision deficiency (see, e.g., here). This is important to keep in mind when choosing colors for visualizations. For example consider the following scatter plot using a Red-Yellow-Green color palette, knowing that Red-Green colorblindness is the most frequent type of color deficiency.

To check how the plots appear for color deficient persons you can use the cvd_grip function from the colorblindr package (install instructions on the Github colorblindr repo).

colorblindr::cvd_grid() +
  medtheme()

Using a different color palette can help. For example the following:

Code for the above plot:

ggplot(mtcars) +
  geom_point(aes(mpg, disp, color = factor(carb))) +
  scale_color_OkabeIto()

Another option is the dichromat package (CRAN link) which features multiple palettes for people with red-green colorblindness.

Quiz

Let’s have a look at the graph, : Image credit: We discovered the graph here: https://statmodeling.stat.columbia.edu/2008/06/06/new_candidate_f/. The picture is not from a scientific study but originates from a blog: http://www.shakesville.com/2007/09/rotundity-revealed.html.

1

What aspect of the “Be simple, clear and to the point” input has been violated?

  • 3D
  • occlusion
  • use of pie charts
  • arrangement of multiple plots

Solution

T 3D
T occlusion
F use of pie charts
F arrangement of multiple plots

2

Is the data shown appropriately by the plot through

  • the height of bars?
  • the values on bars?
  • additional values in white font?
  • the tick marks indicating to which value each bar belongs?
  • the raw data?

Solution

  • the height of bars?
  • the values on bars?
  • additional values in white font?
  • the tick marks indicating to which value each bar belongs?
  • the raw data?

3

What could be the reason that the Y-axis is shown starting at the value 55?

  • the value 55 could be the smallest possible value in the context
  • the differences between the bars are more pronounced than if the Y-axis starts at zero.
  • the values below 55 need to be hidden

Solution

T the value 55 could be the smallest possible value in the context
T the differences between the bars are more pronounced than if the Y-axis starts at zero.
T the values below 55 need to be hidden

4

When you think about the information regarding the axes, please tick which one of the following items is present in the plot

  • Sensible X-axis tick label display
  • Sensible X-axis label
  • Sensible Y-axis tick values
  • Sensible Y axis label

Solution

F Sensible X-axis tick label display
F Sensible X-axis label
F Sensible Y-axis tick values
F Sensible Y axis label

5

Does the displayed grid help to determine the height of the color sections of the bars?

  • Yes
  • No

Solution

F Yes
T No

6

Is the used color palette color-blind friendly?

  • Yes
  • No

Solution

F Yes
T No

7

Which of the following additional information items does the plot feature?

  • Informative title
  • Informative legend labels
  • Provenance of data
  • Context of data

Solution

F Informative title
F Informative legend labels
F Provenance of data
F Context of data

8

Does the course team think this is a good plot?

  • Yes
  • No

Solution

F Yes
T No

 

 

 

Episode challenge

For this challenge we will work with climate data published by the Bundesamt für Statistic BFS in which various climate related variables measured at different locations in Switzerland have been put together. The data has already been wrangled into a csv file that you can download from here.

The source data was downloaded from here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.02 and here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.03

Task 1: data exploration

In this first task read in the climate_data.csv file and do a short exploration of the dataset.

1.1 First look

Show the top 3 rows of the dataset and additionally a short summary of the dataset (Hint: use summary). Describe what you observe in a few words.

Solution

head(climatedf_comp, n = 3)
  Year       Location Sunshine_duration Altitude Annual_Precipitation
1 1931 BaselBinningen          1594.317      316                816.0
2 1931 BernZollikofen          1742.500      553               1137.6
3 1931          Davos          1767.600     1594               1077.3
  Annual_temperature Annual_ice_days Annual_frost_days Annual_summer_days
1                8.5              NA                NA                 NA
2                7.2              NA                NA                 NA
3                1.7              NA                NA                 NA
  Annual_heat_days Annual_tropic_days Annual_precipitation_days
1               NA                 NA                        NA
2               NA                 NA                        NA
3               NA                 NA                        NA
summary(climatedf_comp)
      Year        Location         Sunshine_duration    Altitude     
 Min.   :1931   Length:1170        Min.   :1046      Min.   : 273.0  
 1st Qu.:1953   Class :character   1st Qu.:1557      1st Qu.: 411.0  
 Median :1976   Mode  :character   Median :1725      Median : 485.0  
 Mean   :1976                      Mean   :1759      Mean   : 805.9  
 3rd Qu.:1998                      3rd Qu.:1937      3rd Qu.: 776.0  
 Max.   :2020                      Max.   :2741      Max.   :2501.0  
                                   NA's   :129                       
 Annual_Precipitation Annual_temperature Annual_ice_days  Annual_frost_days
 Min.   : 338.9       Min.   :-3.300     Min.   :  0.00   Min.   :  1.00   
 1st Qu.: 829.9       1st Qu.: 6.925     1st Qu.:  5.00   1st Qu.: 60.25   
 Median :1050.5       Median : 9.000     Median : 17.00   Median : 87.00   
 Mean   :1212.9       Mean   : 7.755     Mean   : 33.05   Mean   :107.99   
 3rd Qu.:1411.8       3rd Qu.:10.400     3rd Qu.: 41.00   3rd Qu.:120.00   
 Max.   :3704.2       Max.   :13.900     Max.   :218.00   Max.   :289.00   
                                         NA's   :364      NA's   :364      
 Annual_summer_days Annual_heat_days Annual_tropic_days
 Min.   :  0.00     Min.   : 0.000   Min.   : 0.000    
 1st Qu.: 12.25     1st Qu.: 0.000   1st Qu.: 0.000    
 Median : 38.00     Median : 2.000   Median : 0.000    
 Mean   : 36.95     Mean   : 6.257   Mean   : 1.561    
 3rd Qu.: 56.00     3rd Qu.:10.000   3rd Qu.: 1.000    
 Max.   :125.00     Max.   :56.000   Max.   :40.000    
 NA's   :364        NA's   :364      NA's   :364       
 Annual_precipitation_days
 Min.   : 64.0            
 1st Qu.:100.0            
 Median :120.0            
 Mean   :119.8            
 3rd Qu.:136.0            
 Max.   :229.0            
 NA's   :365              

1.2 Which has been the hottest year?

Solution

climatedf_comp %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(mean_temp = mean(Annual_temperature)) %>% 
  dplyr::filter(mean_temp == max(mean_temp)) %>% 
  dplyr::pull(Year)
[1] 2018

1.3 Which has been the coldest year?

Solution

climatedf_comp %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(mean_temp = mean(Annual_temperature)) %>% 
  dplyr::filter(mean_temp == min(mean_temp)) %>% 
  dplyr::pull(Year)
[1] 1956

Task 2: visualization

2.1 Association of Annual_temperature and Year

The goal is to visualize the association of Annual_temperature and Year. To increase the visibility we will only look at the locations ZürichFluntern, Säntis, Samedan, LocarnoMonti.

Choose a suitable visualization (maybe consider looking at the decision tree) and plot the respective graph.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>%
  ggplot() +
  geom_line(aes(Year, Annual_temperature, color = Location)) +
  labs(y = "Annual temperature")

2.2 Add information on the altitude

Based on the previous plot update / change your plot to also include the information about the altitude. Make sure that the location information is also provided.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>%
  ggplot() +
  geom_line(aes(Year, Annual_temperature, color = Altitude, group = Location)) +
  geom_label(aes(Year, Annual_temperature, label = Location), 
             data = climatedf_comp %>% 
               dplyr::filter(Location %in% c("ZürichFluntern", 
                                             "Säntis", 
                                             "Samedan", 
                                             "LocarnoMonti")) %>% 
               dplyr::filter(Year==min(Year)+5),nudge_y = 1) +
  labs(y = "Annual temperature")

2.3 Normalization

In the next step we want to normalize the Annual temperature by using the values of the years <1951 as a base. I.e. calculate the mean Annual_temperature for Year<1951 for each Location and subtract this value from Annual_temperature. Present a visualization that allows to study the deviation from this annual mean by location.

Solution

climatedf_comp_translated <- climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>% 
  dplyr::group_by(Location) %>% 
  dplyr::mutate(mean_temperature = mean(Annual_temperature[Year < 1951]),
                Annual_temperature = Annual_temperature - mean_temperature)

ggplot(climatedf_comp_translated) +
  geom_line(aes(Year, Annual_temperature, color = Altitude, group = Location)) +
  facet_wrap( ~ Location) +
  # geom_label(aes(Year, Annual_temperature, label = Location), 
  #            data = climatedf_comp_translated %>%
  #              dplyr::filter(Location %in% c("ZürichFluntern",
  #                                            "Säntis",
  #                                            "Samedan",
  #                                            "LocarnoMonti")) %>%
  #              dplyr::filter(Year == min(Year) + 5), nudge_y = 1) +
  labs(y = "Annual temperature deviation from average up to 1951")

2.4 Associations between Annual_Precipitation, and Sunshine_duration

The next goal is to explore associations between Annual_Precipitation, and Sunshine_duration for the locations ZürichFluntern,Säntis,Samedan,LocarnoMonti. Present at least two different types of plots.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_point(aes(Annual_Precipitation, Sunshine_duration, color = Location))

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>% 
  # dplyr::group_by(Location) %>% 
  # dplyr::mutate(Sunshine_duration = scale(Sunshine_duration)) %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_violin(aes(Location, Annual_Precipitation, color = Sunshine_duration)) +
  ggforce::geom_sina(aes(Location, Annual_Precipitation, color = Sunshine_duration))
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>% 
  # dplyr::group_by(Location) %>% 
  # dplyr::mutate(Sunshine_duration = scale(Sunshine_duration))  %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_boxplot(aes(Location, Sunshine_duration, color = Annual_Precipitation)) +
  geom_jitter(aes(Location, Sunshine_duration, color = Annual_Precipitation))
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

  # ggforce::geom_sina(aes(Location, Annual_Precipitation, color = Sunshine_duration))

 

 

 

Bonus challenge

We have already shortly had a look at facets which allow to easily arrange multiple plots. But so far we have only considered the case where each subplot shows the same variables, e.g Sunshine_duration vs. Annual_frost_days. What if instead you would like to use facets to plot multiple variables? For instance you would like to do a plot containing two subplots, the first Annual_frost_days vs. Sunshine_duration and the second Annual_summer_days vs. Sunshine_duration?

There are basically two options:

We will in the following explore both options.

 

Combine plots

There are many options available how to combine plots. Two useful packages are cowplot (for all graphics) and ggpubr (for ggplots). In this exercise we will use ggpubr.

 

Exercise 1

Create two ggplot2 scatterplots, Annual_frost_days vs. Sunshine_duration and Annual_summer_days vs. Sunshine_duration, color by location. Combine the two plots using ggpubr::ggarrange and make sure to have only one legend. Also make sure to have the same axis range in both plots.

Solution

minmax <- c(min=min(na.omit(c(climatedf_comp$Annual_frost_days, 
                              climatedf_comp$Annual_summer_days))),
               max=max(na.omit(c(climatedf_comp$Annual_frost_days, 
                                 climatedf_comp$Annual_summer_days))))
pl1 <- climatedf_comp %>% 
  ggplot() +
  geom_point(aes(Annual_frost_days, Sunshine_duration, color = Location)) +
  xlim(minmax)+
  labs(x="Annual frost days",
       y="Sunshine duration")

pl2 <- climatedf_comp %>% 
  ggplot() +
  geom_point(aes(Annual_summer_days, Sunshine_duration, color = Location)) +
  xlim(minmax)+
  labs(x="Annual summer days",
       y="Sunshine duration")

ggpubr::ggarrange(pl1, pl2, common.legend = TRUE)

 

Facets

The second option is to use facets (e.g. ggplot2::facet_wrap). Since our data is currently not in the correct format we first have to bring it into shape. This can be done using tidyr::pivot_longer which transforms data from wide to long format. The wide format means we have multiple values per row while the long format means we only have a single value while the remaining columns act as an identifier of the sample. You can learn more about pivot, long and wide formats by running vignette("pivot",package = "tidyr") in the console.

 

Exercise 2

Use tidyr::pivot_longer to bring the data into long format and plot Annual_frost_days vs. Sunshine_duration and Annual_summer_days vs. Sunshine_duration in the same plot using ggplot2::facet_wrap.

Hint: The columns to pivot into longer format are Annual_frost_days and Annual_summer_days.

Solution

climatedf_comp %>% 
  dplyr::select(Location, Year, Sunshine_duration, 
                Annual_frost_days, Annual_summer_days) %>% 
  dplyr::rename(`Annual frost days` = Annual_frost_days,
                `Annual summer days` = Annual_summer_days) %>% 
  tidyr::pivot_longer(cols = c("Annual frost days", "Annual summer days")) %>% 
  ggplot() +
  geom_point(aes(value, Sunshine_duration, color = Location)) +
  facet_wrap( ~ name) +
  labs(x = "Days", y = "Sunshine duration") 

 

Optional: ggplot2 theme

Rotate axis text

In some situations where labels on the x-axis are long they can overlap with the default setting:

A solution can be to rotate the labels:

Reproduce the above plot.
Hint: use the argument axis.text.x in the theme function and make sure to check the expected input class in axis.text.x.

Solution

climatedf_comp %>% 
  ggplot() + 
  geom_violin(aes(Location, Sunshine_duration, color = Location)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(y = "Sunshine duration", x = "")

Custom colors

You can generate custom colors using RColorBrewer::brewer.pal. The generated colors can then be used in combination with scale_color_manual(values=generated_colors).

Solution

climatedf_comp_red <- climatedf_comp[climatedf_comp$Location %in% 
                                       c("Luzern","ZürichFluntern","Lugano"), ]
colors_use <- RColorBrewer::brewer.pal(length(unique(climatedf_comp_red$Location)), 
                                       "Set2")

climatedf_comp_red %>% 
  ggplot() +
  geom_violin(aes(Location, Sunshine_duration, color = Location, fill = Location)) +
  scale_color_manual(values = colors_use) +
  scale_fill_manual(values = colors_use) +
  labs(y = "Sunshine duration", x = "")

Key Points

  • Be simple, clear and to the point

  • Show the data

  • Be honest about the axes

  • Use colors sensibly