Collaboration drives Open Science and is a challenge for reproducibility
Overview
Teaching: 45 min
Exercises: 60 minQuestions
Why is collaborative work especially important for Open and Reproducible Science?
What are tools that faciliate collaborative work?
Objectives
Learn about Open Science at CERN
Get to know OSF as one tool
Learn about version control systems
1. Use case: Open Science at CERN
The world wide web was invented at CERN and its leadership was visionary in making sure that the technology would be licensed under an open-source framework “with the explicit goal of preventing private companies from turning it into proprietary software”
CERN and the particle physics community are trailblazers of the Open Science movement.
We aim to look at CERN’s approach to Open Science by reading three articles that appeared in the CERN Courier in 2019, see excerpts below and use the links to read the full articles.
Open science: a vision for collaborative, reproducible and reusable research
Image Credit: T.Simko.
“True open science demands more than simply making data available: it needs to concern itself with providing information on how to repeat or verify an analysis performed over given datasets, producing results that can be reused by others for comparison, confirmation or simply for deeper understanding and inspiration. This requires runnable examples of how the research was performed, accompanied by software, documentation, runnable scripts, notebooks, workflows and compute environments. It is often too late to try to document research in such detail once it has been published.”
S Dallmeier-Tiessen and T Simko https://cerncourier.com/a/open-science-a-vision-for-collaborative-reproducible-and-reusable-research/
Inspired by software
Image Credit: S Kulkarni.
"”The underlying ideal is open collaboration: peers freely, collectively and publicly build software solutions. A second ideal is recognition, in which credit for the contributions made by individuals and organisations worldwide is openly acknowledged. A third ideal concerns rights, specifically the so-called four freedoms granted to users: to use the software for any purpose; to study the source code to understand how it works; to share and redistribute the software; and to improve the software and share the improvements with the community. Users and developers therefore contribute to a virtuous circle in which software is continuously improved and shared towards a common good, minimising vendor lock-in for users.”
G Tenaglia and T Smith https://cerncourier.com/a/inspired-by-software/
Preserving the legacy of particle physics
Image Credit: https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/ with original CC-By in Phys. Lett. B 716 30.
“CMS aims to release half of each year’s level-three data three years after data taking, and 100% of the data within a ten-year window. By guaranteeing that people outside CMS can use these data, says Lassila-Perini, the collaboration can ensure that the knowledge of how to analyse the data is not lost, while allowing people outside CMS to look for things the collaboration might not have time for. To allow external re-use of the data, CMS released appropriate metadata as well as analysis examples.”
A Rao https://cerncourier.com/a/preserving-the-legacy-of-particle-physics/.
More information can be found in the article “Open is not enough” by X Chen et al. https://www.nature.com/articles/s41567-018-0342-2.
Open Science is about collaboration
Collaborative research becomes more and more important since complex challenges require a diverse team science approach, e.g. particle physics, drug development, big data projects in medicine or social science etc.
Collaborative research entails specific practical issues that may affect reproducibility when different versions of files are worked on by several collaborators.
Collaborative tools can be used to make research accessible to the public beyond publications, e.g. protocols, code, data.
Quiz on open science at CERN
Reana
CERN’s REANA can be used to
- publish finished analysis results
- submit parameterised computational workflows to run on remote compute clouds
- reinterpret preserved analyses
- run “active” analyses before they are published
Solution
F publish finished analysis results
T submit parameterised computational workflows to run on remote compute clouds
T reinterpret preserved analyses
T run “active” analyses before they are published
Software
Having experienced first-hand its potential to connect physicists around the globe, in 1993 CERN released the web software into the:
Solution
public domain
Levels of open data at CERN
The four main LHC experiments have started to periodically release their data in an open manner, and these data can be classified into four levels. Check the correct level descriptions
- The first level consists of the numerical data underlying publications.
- The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
- The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
- The fourth level is the raw data generated by the detectors.
Solution
F The first level consists of the numerical data underlying publications.
T The second level concerns datasets in a simplified format that are suitable for “lightweight” analyses in educational or similar contexts
T The third level are the data being used for analysis by the researchers themselves, requiring specialised code and dedicated computing resources.
T The fourth level is the raw data generated by the detectors.
2. Some tools for collaboration
Open Science framework
Image Credit: https://thriv.virginia.edu/center-for-open-science-open-science-framework/
The framework is developed by the Center for Open Science (COS), a non-profit organisation in the USA with the mission to increase the openness reproducibility and integrity of scientific research
The main tool that they build and maintain is the Open Science framework, called OSF, which is a free and open-source tool.
The design principle of the tool is to make it easy to practice open and reproducible research practices at all of the many stages of the research lifecycle.
Researchers are encouraged through the framework to start engaging with the idea of what material to share systematically and early on. Sharing publicly but also with collaborators before the manuscript editing phase.
Introduction to the OSF
One of the best ways to learn about the OSF is through a video provided by COS. The video is long (although you may watch it a increased speed), make sure you learn about the following features of OSF:
- Dashboard
- Create a project
- Give your project a structure
- How to add contributors and their roles and permissions
- Global unique identifiers
- Wiki and how to edit it
- Adding files and moving them within a project
- Version control that is embedded in OSF
- OSF and doi
You can find the video here.
Other tools: CRS primer
Other tools for collaboration have been summarized in the Primer “Digital collaboration” by the Center for Reproducible Science at the University of Zurich. The primer contains a few University of Zurich specific recommendations but is mostly applicable for anyone.
Quiz on the open science framework as a collaborative tool
Global unique identifiers
OSF distributes global unique identifiers
- only at the project level
- at the project, component and file level
- each time you make changes to a project
Solution
F only at the project level
T at the project, component and file level
F each time you make changes to a project
Wiki syntax
Wikis on OSF can be written in a “what you see is what you get” way and using the syntax of:
Solution
Markdown
Version control in OSF
Binary file types such as Word files or pdfs are version controlled on OSF through
- use of an online editor
- adding version indicators to file names
- recognition of file names in components
Solution
F use of an online editor
F adding version indicators to file names
T recognition of file names in components
3. What is version control and what is Git?
The purpose of Git is best explained with the below cartoon: it is a system that allows to avoid situations like in the cartoon. Such systems are called version control systems since they are designed to take care of versioning of files without changing (and lengthening) file names. The purpose here is not to teach you Git but to inform you detailed enough such that you can decide if or if not you need to learn Git. You will also learn a bit of terminology and get some links such that a start with Git should be easier.
Image Credit: “Piled Higher and Deeper” by Jorge Cham at https://www.phdcomics.com. (permission requested)
What is Git?
- Git is:
- a version control system, i.e., tracks changes incl. timestamps
- the de facto standard
- open source, developed by Linus Torvalds in 2005
- Git runs on all major operating systems
- Several IDEs (Integrated Development Environments) available:
- RStudio
- Eclipse https://www.eclipse.org/
Git has a reputation to be complicated
Git is a tool that originated in software development and hence in order to use it a certain computer skill level is necessary. As a result it has a reputation to be complicated. Similar to code-based analysis there is indeed a certain learning curve in the beginning but with just a bit of practice the advantages outweigh the initial investment.
Image Credit: Randall Munroe/xkcd at https://xkcd.com/21597/ licensed as CC BY-NC.
Why use it anyhow?
- It provides a completely documented past
- Collaborators have coordinated access to the same documents
- It allows easy synchronization for local files (offline working)
- Tools to resolve conflicts for text based files are available
- and of course one can avoid file names like
masterManuscript_v4_rf_0812_gh.doc
- (further benefits on a code development level)
More terminology
- A Git repository is a collection of files, typically organized as a project, managed with a version control.
- GitLab is a web-based tool that provides a Git repository manager, see e.g. the commercial https://www.gitlab.com. At the University of Zurich https://gitlab.uzh.ch/ and https://git.math.uzh.ch are available, maybe your institutions offers an instance as well?
- GitHub is a commercial provider of Internet hosting for software development and version control using Git: https://www.github.com
- The remote version of a repository can be “cloned” to the computers of all collaborators of the project.
Git is a decentralized version control system
Developers work directly with their own local repository on their computer, i.e. a folder on their computer: the local workspace in the below graphic.
By using the command “git add” they add files or folders to the local index, which is similar to a registry in a library. This step is also called staging. Then they “git commit” their staged changes to the local repository on their computer creating a version that will be kept in the system. Only with the “git push” command do they upload the changes to the remote server.
The next person working on the same repository will need to “git pull” the updated repository in order to access the changes.
Image Credit: Illustration of the most important git commands by Eva Furrer, CC-BY, https://doi.org/10.5281/zenodo.7994551.
Installation
Open Rstudio, go to the terminal tab and type git --version
to check if you already have Git installed.
If you do not have it go to https://git-scm.com/downloads and choose the correct operating system for the download.
When you are ready to run Git locally on your computer you can start using it together with a remote service (see above).
Want to know more?
Using Git with RStudio: http://r-bio.github.io/intro-git-rstudio/ and https://jennybc.github.io/2014-05-12-ubc/ubc-r/session03_git.html
Git book: https://git-scm.com/book/en/v2/
Tutorial: https://doi.org/10.1177/2515245918754826
Point-by-point instructions to connect with ssh https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh
Quiz on Git
Advantages
Why is a version control system useful when working on analysis scripts, even if working alone.
- Git allows you to review the history of your project.
- If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
- Git retains local copies of repositories, resulting in fast operations.
- Git automatically scans for all changes and hence tracks everything.
Solution
T Git allows you to review the history of your project.
T If something breaks due to a change, you can fix the problem by reverting to a working version before the change.
F Git retains local copies of repositories, resulting in fast operations.
F Git automatically scans for all changes and hence tracks everything.
Episode challenge
In-classtask
First, we will add all participants to a common OSF project.
Task 1
We work on publicly available data from 13 weather stations in Switzerland: Sunshine duration, precipitation, temperature and fresh snow (1931 – 2022) and Ice days, frost days, summer days, heat days, tropical nights, precipitation days (1959 – 2022). We will collaboratively summarize the data into (approximatively) 30 year averages (see below) per station for each of the 10 available characteristics. Create a corresponding csv file with 14 columns including a column identifying characteristic and time period and one column per station and upload it to an OSF project in which all participants are members. Distribute the work of calculating averages and putting them into the common csv file in the group of participants.
Note: use the approximatve 30 year periods 1931 – 1958, 1959 – 1988, 1989 – 2022
Task 2
Add a Readme file to the project describing the content of the project, all participants agree on the wording and correct if necessary. Discuss the following questions. 1. What are the main difficulties when collaboratively editing the same file(s)? 2. What are the advantages in using text based files such as .R, .csv and .md files? and add the group’s thought in the Readme.
Key Points
Collaboration is fundamental for science, especially Open Science
Learning to use tools for collaboration is effective and helps to avoid problems