This lesson is still being designed and assembled (Pre-Alpha version)

Collaboration drives Open Science and is a challenge for reproducibility

Overview

Teaching: 45 min
Exercises: 60 min
Questions
  • Why is collaborative work especially important for Open and Reproducible Science?

  • What are tools that faciliate collaborative work?

Objectives
  • Learn about Open Science at CERN

  • Get to know OSF as one tool

  • Learn about version control systems

1. Use case: Open Science at CERN

The world wide web was invented at CERN and its leadership was visionary in making sure that the technology would be licensed under an open-source framework “with the explicit goal of preventing private companies from turning it into proprietary software”

CERN and the particle physics community are trailblazers of the Open Science movement.

We aim to look at CERN’s approach to Open Science by reading three articles that appeared in the CERN Courier in 2019, see excerpts below and use the links to read the full articles.

Open science: a vision for collaborative, reproducible and reusable research

Image Credit: T.Simko.

“True open science demands more than simply making data available: it needs to concern itself with providing information on how to repeat or verify an analysis performed over given datasets, producing results that can be reused by others for comparison, confirmation or simply for deeper understanding and inspiration. This requires runnable examples of how the research was performed, accompanied by software, documentation, runnable scripts, notebooks, workflows and compute environments. It is often too late to try to document research in such detail once it has been published.”

S Dallmeier-Tiessen and T Simko https://cerncourier.com/a/open-science-a-vision-for-collaborative-reproducible-and-reusable-research/

Inspired by software

Image Credit: S Kulkarni.

"”The underlying ideal is open collaboration: peers freely, collectively and publicly build software solutions. A second ideal is recognition, in which credit for the contributions made by individuals and organisations worldwide is openly acknowledged. A third ideal concerns rights, specifically the so-called four freedoms granted to users: to use the software for any purpose; to study the source code to understand how it works; to share and redistribute the software; and to improve the software and share the improvements with the community. Users and developers therefore contribute to a virtuous circle in which software is continuously improved and shared towards a common good, minimising vendor lock-in for users.”

G Tenaglia and T Smith https://cerncourier.com/a/inspired-by-software/

Preserving the legacy of particle physics

Image Credit: https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/ with original CC-By in Phys. Lett. B 716 30.

“CMS aims to release half of each year’s level-three data three years after data taking, and 100% of the data within a ten-year window. By guaranteeing that people outside CMS can use these data, says Lassila-Perini, the collaboration can ensure that the knowledge of how to analyse the data is not lost, while allowing people outside CMS to look for things the collaboration might not have time for. To allow external re-use of the data, CMS released appropriate metadata as well as analysis examples.”

A Rao https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/.

More information can be found in the article “Open is not enough” by X Chen et al. https://www.nature.com/articles/s41567-018-0342-2.

Open Science is about collaboration

  • Collaborative research becomes more and more important since complex challenges require a diverse team science approach, e.g. particle physics, drug development, big data projects in medicine or social science etc.

  • Collaborative research entails specific practical issues that may affect reproducibility when different versions of files are worked on by several collaborators.

  • Collaborative tools can be used to make research accessible to the public beyond publications, e.g. protocols, code, data.

Quiz on open science at CERN

Reana

CERN’s REANA can be used to

  • publish finished analysis results
  • submit parameterised computational workflows to run on remote compute clouds
  • reinterpret preserved analyses
  • run “active” analyses before they are published

Solution

F publish finished analysis results
T submit parameterised computational workflows to run on remote compute clouds
T reinterpret preserved analyses
T run “active” analyses before they are published

Software

Having experienced first-hand its potential to connect physicists around the globe, in 1993 CERN released the web software into the:

Solution

public domain

Levels of open data at CERN

The four main LHC experiments have started to periodically release their data in an open manner, and these data can be classified into four levels. Check the correct level descriptions

  • The first level consists of the numerical data underlying publications.
  • The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
  • The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
  • The fourth level is the raw data generated by the detectors.

Solution

F The first level consists of the numerical data underlying publications.
T The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
T The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
T The fourth level is the raw data generated by the detectors.

 

2. Some tools for collaboration

Open Science framework

Image Credit: https://thriv.virginia.edu/center-for-open-science-open-science-framework/

  • The framework is developed by the Center for Open Science (COS), a non-profit organisation in the USA with the mission to increase the openness reproducibility and integrity of scientific research

  • The main tool that they build and maintain is the Open Science framework, called OSF, which is a free and open-source tool.

  • The design principle of the tool is to make it easy to practice open and reproducible research practices at all of the many stages of the research lifecycle.

  • Researchers are encouraged through the framework to start engaging with the idea of what material to share systematically and early on. Sharing publicly but also with collaborators before the manuscript editing phase.

Introduction to the OSF

One of the best ways to learn about the OSF is through a video provided by COS. The video is long (although you may watch it a increased speed), make sure you learn about the following features of OSF:

You can find the video here.

Other tools: CRS primer

Other tools for collaboration have been summarized in the Primer “Digital collaboration” by the Center for Reproducible Science at the University of Zurich. The primer contains a few University of Zurich specific recommendations but is mostly applicable for anyone.

Quiz on the open science framework as a collaborative tool

Global unique identifiers

OSF distributes global unique identifiers

  • only at the project level
  • at the project, component and file level
  • each time you make changes to a project

Solution

F only at the project level
T at the project, component and file level
F each time you make changes to a project

Wiki syntax

Wikis on OSF can be written in a “what you see is what you get” way and using the syntax of:

Solution

Markdown

Version control in OSF

Binary file types such as Word files or pdfs are version controlled on OSF through

  • use of an online editor
  • adding version indicators to file names
  • recognition of file names in components

Solution

F use of an online editor
F adding version indicators to file names
T recognition of file names in components

 

3. What is version control and what is Git?

The purpose of Git is best explained with the below cartoon: it is a system that allows to avoid situations like in the cartoon. Such systems are called version control systems since they are designed to take care of versioning of files without changing (and lengthening) file names. The purpose here is not to teach you Git but to inform you detailed enough such that you can decide if or if not you need to learn Git. You will also learn a bit of terminology and get some links such that a start with Git should be easier.

Image Credit: “Piled Higher and Deeper” by Jorge Cham at https://www.phdcomics.com. (permission requested)

What is Git?

  • Git is:
    • a version control system, i.e., tracks changes incl. timestamps
    • the de facto standard
    • open source, developed by Linus Torvalds in 2005
  • Git runs on all major operating systems
  • Several IDEs (Integrated Development Environments) available:
    • RStudio
    • Eclipse https://www.eclipse.org/

Git has a reputation to be complicated

Git is a tool that originated in software development and hence in order to use it a certain computer skill level is necessary. As a result it has a reputation to be complicated. Similar to code-based analysis there is indeed a certain learning curve in the beginning but with just a bit of practice the advantages outweigh the initial investment.

Image Credit: Randall Munroe/xkcd at https://xkcd.com/21597/ licensed as CC BY-NC.

Why use it anyhow?

  • It provides a completely documented past
  • Collaborators have coordinated access to the same documents
  • It allows easy synchronization for local files (offline working)
  • Tools to resolve conflicts for text based files are available
  • and of course one can avoid file names like masterManuscript_v4_rf_0812_gh.doc
  • (further benefits on a code development level)

More terminology

Git is a decentralized version control system

Developers work directly with their own local repository on their computer, i.e. a folder on their computer: the local workspace in the below graphic.

By using the command “git add” they add files or folders to the local index, which is similar to a registry in a library. This step is also called staging. Then they “git commit” their staged changes to the local repository on their computer creating a version that will be kept in the system. Only with the “git push” command do they upload the changes to the remote server.

The next person working on the same repository will need to “git pull” the updated repository in order to access the changes.

Image Credit: Illustration of the most important git commands by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994551.

Installation

Open Rstudio, go to the terminal tab and type git --version to check if you already have Git installed.

If you do not have it go to https://git-scm.com/downloads and choose the correct operating system for the download.

When you are ready to run Git locally on your computer you can start using it together with a remote service (see above).

Want to know more?

Quiz on Git

Advantages

Why is a version control system useful when working on analysis scripts, even if working alone.

  • Git allows you to review the history of your project.
  • If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
  • Git retains local copies of repositories, resulting in fast operations.
  • Git automatically scans for all changes and hence tracks everything.

Solution

T Git allows you to review the history of your project.
T If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
F Git retains local copies of repositories, resulting in fast operations.
F Git automatically scans for all changes and hence tracks everything.

 

Episode challenge

In-classtask

First, we will add all participants to a common OSF project.

Task 1

We work on publicly available data from 13 weather stations in Switzerland: Sunshine duration, precipitation, temperature and fresh snow (1931 – 2022) and Ice days, frost days, summer days, heat days, tropical nights, precipitation days (1959 – 2022). We will collaboratively summarize the data into (approximatively) 30 year averages (see below) per station for each of the 10 available characteristics. Create a corresponding csv file with 14 columns including a column identifying characteristic and time period and one column per station and upload it to an OSF project in which all participants are members. Distribute the work of calculating averages and putting them into the common csv file in the group of participants.

Note: use the approximatve 30 year periods 1931 – 1958, 1959 – 1988, 1989 – 2022

Task 2

Add a Readme file to the project describing the content of the project, all participants agree on the wording and correct if necessary. Discuss the following questions. 1. What are the main difficulties when collaboratively editing the same file(s)? 2. What are the advantages in using text based files such as .R, .csv and .md files? and add the group’s thought in the Readme.

Key Points

  • Collaboration is fundamental for science, especially Open Science

  • Learning to use tools for collaboration is effective and helps to avoid problems