UZH CRS Workshop: Design and Analysis of Replication Studies

Program Info

22 Jan, 2020

12 : 45 - 13 : 00

Welcome

13 : 00 - 16 : 00

Tutorial on The R package ReplicationSuccess: Design of Replication Experiments

By Charlotte Micheloud, Samuel Pawel, Leonhard Held

Statistical power is of central importance in assessing the reliability of science. Appropriate design of a replication study is key to tackling the replication crisis as many such studies are currently severely under-powered. The workshop will describe standard and more advanced methods to calculate the required sample size of a replication study taking into account the results of an original discovery study. Participants will learn how to use the R-package ReplicationSuccess. Prerequisites include basic R-knowledge and familiarity with concepts of statistical inference.

18 : 00 -

Get-together

The Alehouse 2017, Universitätstrasse 23, 8006 Zürich

23 Jan, 2020

08 : 30 - 09 : 00

Registration

09 : 00 - 09 : 05

Welcome

By Leonhard Held, CRS Director

9 : 05 - 9 : 50

What should we "expect" from reproducibility?

By Stephen Senn, Consultant Statistician, Edinburgh

Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance. Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.

09 : 50 - 10 : 20

When is a replication successful? No p-values, please.

By Werner Stahel, Seminar für Statistik, Swiss Federal Institute of Technology, Zürich

When are the results of a replication study similar enough to those of the original to say that the original claim was confirmed or that the replication was successful?Often, this decision is based on statistical significance: The original found an effect to be significant, and the replication is called successful if it has again produced a significant estimate of the effect. It is desirable to devise measures of the degree of success, avoiding a simple dichotomy.

It is now widely propagated that null hypothesis testing be replaced by estimation with confidence intervals. In fact, science is not interested in tiny effects, and therefore, a scientific question asks if an effect is relevantly different from zero, and a threshold for "relevance" is needed. This has consequences for the interpretation of both an "original" study and its replication. Nevertheless, the literature still focusses on p-values, and even in this workshop, we hear about quite sophisticated methods using them.

The simplest measure of (dis-) similarity between an original and a replication study is the difference between effect sizes. Some care regarding standardization is needed to make sure that it is a parameter of the model that can be estimated. Based of an estimate of the similarity and of the effect in the replication, I propose a classification of results that characterizes the "success of replication".

In random effects meta-analysis, an important characteristic is the ratio of between-study and within-study variation. Surprisingly, the usual definition fails to be a parameter of the model. When an "original" study should be substantiated by replication, we should expect both a selective reporting bias in the original and a variability component between different potential replication studies. Since these two components cannot be separated on the basis of a single replication, more than one replication is usually needed, and a strategy for a "replication process" is required.

Break

10 : 20 - 10 : 50

Coffee break

10 : 50 - 11 : 35

Shrinkage for reproducible research

By E.W. van Zwet, Leiden University Medical Center

The pressure to publish or perish undoubtedly leads to the publication of much poor research. However, the fact that significant effects tend to be smaller and less significant upon attempts to reproduce them, is also due to selection bias. I will discuss this "winner's curse" in some detail and show that it is largest in low powered studies. To correct for it, it is necessary to apply some shrinkage. To determine the appropriate amount of shrinkage, I propose to embed the study of interest into a large area of research, and then to estimate the distribution of effect sizes across that area of research. Using this estimated distribution as a prior, Bayes' rule provides the amount of shrinkage that is well-calibrated to the chosen area of research. I demonstrate the approach with data from the OSC project on reproducibility in psychology, and with data from 100 phase 3 clinical trials.

11 : 35 - 12 : 05

The role of replication studies in navigating epistemic landscapes

By Filip Melinščak, University of Zurich

There has been a great deal of debate on the role of replication studies in improving the reliability of science. Most of these debates have revolved around research that can be conceptualized as either testing null hypotheses or doing comparisons between multiple theoretical models. For these types of research, formal models describing the scientific process and illuminating the role of reproducibility within it, have recently been proposed (McElreath & Smaldino 2015, Devezer et al. 2019). However, many domains of science - especially applied ones - investigate a rather different type of questions. For example: “how to design an industrial process for maximal efficiency?”, “what is the optimal treatment plan for a given disease?”, “what educational policy would maximize student outcomes?”. We can fruitfully conceptualize such research programs as trying to empirically find optimal solutions to various problems, which has been likened to exploring “epistemic landscapes” in search of “peaks” (Weisberg & Muldoon, 2009). Here we investigate what is the role of replication studies - and experimental design, generally - in such optimization-centric research programs. Using agent-based modeling, we evaluate how efficiently can different research strategies find peaks in multidimensional epistemic landscapes.

Break

12 : 05 - 13 : 30

Flying lunch

13 : 30 - 14 : 00

Replicability as generalizability: Revisiting external validity with specification curve analysis

By Johannes Ullrich, University of Zurich

Healthy scientific discourse involves scientists questioning one another’s findings. The question „Why should I believe you?“ is essentially a question about internal validity. As originally defined by psychologist Don Campbell, internal validity deals with the question, „did in fact the experimental stimulus make some significant difference in this specific instance?“ (Campbell, 1957). In recent years, the revival of good scientific practices such as replication and hypothesizing before the results are known have arguably improved internal validity. However, scientists should not only be wondering if a given effect exists at all, but also to what extent the effect can be generalized across populations, settings, and variables. In other words, they should be concerned about external validity as well. We reinterpret the old terms of internal and external validity by drawing on the notions of the ‚garden of forking paths’ and ‚researcher-degrees-of-freedom’, i.e., the fact that for one and the same dataset there often exist hundreds or thousands of different ways of testing the same hypothesis. We illustrate the use of specification curve analysis (Simonsohn, Simmons, & Nelson, 2015) for assessing internal and external validity in a common framework. Specification curve analysis in the service of validity testing involves estimating a test statistic (e.g., effect size) conditional on variations in model specification (e.g., analytic decisions, populations, settings, and variables). We conclude with a discussion of the value of specification curve analysis for the evolution of a line of research.

14 : 00 - 14 : 30

Experimental replications in animal trials

By Florian Frommlet, Medical University Vienna

The recent discussion on reproducibility of scientific results is particularly relevant for preclinical research with animal models. Within that research community there exists some tradition to repeat an experiment three times to demonstrate replicability. However, there are hardly any guidelines about how to plan for such an experimental design and also how to report the results obtained. This article provides a thorough statistical analysis of the 'three-times' rule as it is currently often applied in practice and gives some recommendations how to improve on study design and statistical analysis of replicated animal experiments.

14 : 30 - 15 : 00

Identifying boundary conditions in confirmatory preclinical animal studies to increase value and foster translation

By Meggie Danziger, QUEST Center for Transforming Biomedical Research at the Berlin Institute of Health

Background: Low statistical power in preclinical animal experiments has been repeatedly pointed out as a roadblock to successful replication and translation. If only a small number of tested interventions is effective (i. e. low pre-study odds), researchers should increase the power of their experiments to detect those true effects. This, however, contradicts ethical and budget constraints. To increase the scientific value of preclinical experiments under these constraints, it is necessary to devise strategies that result in maximally efficient confirmatory studies.

Methods: To this end, we explore different approaches to perform preclinical animal experiments via simulations. We model the preclinical research trajectory from the exploratory stage to the results of a within-lab confirmatory study. Critically, we employ different decision criteria that indicate when one should move from the exploratory stage to the confirmatory stage as well as various approaches to determine the sample size for a confirmatory study (smallest effect size of interest (SESOI), safeguard, and standard power analysis). At the confirmatory stage, different experimental designs (fixed-N and sequential with and without futility criterion) and types of analyses (two sample t-test and Bayes factor) are explored. The different trajectories of the research chain are compared regarding the number of experiments proceeding to the confirmatory stage, number of animals needed, positive predictive value (PPV), and statistical power.

Break

15 : 00 - 15 : 30

Coffee break

15 : 30 - 16 : 15

Evaluating statistical evidence in biomedical research, meta-studies, and radical randomization

By Don van Ravenzwaaij, University of Groningen

For the endorsement of new medications, the US Food and Drug Administration requires replication of the main effect in randomized clinical trials. Typically, this replication comes down to observing two trials, each with a p-value below 0.05. In the first part of this talk, I discuss work from a simulation study (van Ravenzwaaij & Ioannidis, 2017) that shows what it means to have exactly two trials with a p-value below 0.05 in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below 0.05 have wildly differing Bayes factors. In a non-trivial number of cases, evidence actually points to the null hypothesis. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. In the second part of this talk, I will propose a different way to go about replication: the use of meta-studies with radical randomization (Baribault et al, 2018).

16 : 15 - 17 : 00

The harmonic mean chi-squared test to substantiate scientific findings

By Leonhard Held, University of Zurich

A new significance test is proposed to substantiate scientific findings from multiple primary studies investigating the same research hypothesis. The test statistic is based on the harmonic mean of the squared study-specific test statistics and can also include weights. Appropriate scaling ensures that, for any number of studies, the null distribution is a chi-squared distribution with one degree of freedom. The null distribution can be used to compute a one-sided p-value or to ensure Type-I error control at a pre-specified level. Further properties are discussed and a comparison with FDA's two-trials rule for drug approval is made, as well as with alternative research synthesis methods. As a by-product, the approach provides a calibration of the sceptical p-value recently proposed for the analysis of replication studies.

Dinner

19 : 00 - 22 : 00

Conference dinner

z. Alten Löwen, Universitätstrasse 111, 8006 Zürich

24 Jan, 2020

09 : 00 - 09 : 45

The replication Bayes factor and beyond

By E.J. Wagenmakers, University of Amsterdam

In this presentation I outline Bayesian answers to statistical questions surrounding replication success. The key object of interest is the posterior distribution for effect size based on data from an original study. The predictive performance of this posterior distribution can then be examined in light of data from a replication study. Specifically, the "replication Bayes factor" compares the predictive performance of the posterior distribution (which quantifies the opinion of an idealized proponent after seeing data from the original study) to that of the point null hypothesis (which quantifies the opinion of a hardened skeptic). However, we may also compare the predictive performance of the posterior distribution to that of the initial prior distribution (which quantifies the opinion of an unaware proponent who does not know the original study). Finally, the predictive performance of the posterior distribution may also be compared to that of alternative distributions that have a different mean but contain the same amount of information. Together, these methods allow a comprehensive and coherent assessment of the issues that surround the overly general question "did it replicate?".

09 : 45 - 10 : 15

Sceptical Bayes factor priors for the analysis of replication studies

By Guido Consonni, Università Cattolica del Sacro Cuore, Milan

Replication studies have been recently analyzed using a reverse-Bayes approach. In particular, Held (2020; J Roy Stat Soc A) proposes the sceptical p-value. Consonni, Held and Pawel approach the problem using the Bayes factor (BF). Here we present the sceptical Bayes factor (SBF) prior; this prior is meant to challenge the significant findings of the original study through the BF of the point null versus the alternative. Once replication data are made available, replication success can be assessed in terms of compatibility of the SBF prior with the data based on the notion of prior-data-conflict. Another way to evaluate replication success is to test the null versus a variety of alternative hypotheses specified by distinct priors, such as: the SBF prior, the optimistic prior and an objective (power expected posterior) prior. This is ongoing work with Leo Held and Samuel Pawel (University of Zurich).

Break

10 : 15 - 10 : 45

Coffee break

10 : 45 - 11 : 30

Predicting scientific results using surveys and prediction markets.

By Anna Dreber, Stockholm School of Economics

Is there some wisdom of crowds in terms of predicting the outcomes of replication results? In a number of projects we have asked researchers to predict scientific results. In this talk I will discuss some recent work on prediction markets and forecasting surveys in psychology, economics and other fields.

11 : 30 - 12 : 00

Probabilistic forecasting of replication studies

By Samuel Pawel, University of Zurich

Throughout the last decade, the so-called replication crisis has stimulated many researchers to conduct large-scale replication projects. With data from four of these projects, we computed probabilistic forecasts of the replication outcomes, which we then evaluated regarding discrimination, calibration and sharpness. A novel model, which can take into account both inflation and heterogeneity of effects, was used and predicted the effect estimate of the replication study with good performance in two of the four data sets. In the other two data sets, predictive performance was still substantially improved compared to the naive model which does not consider inflation and heterogeneity of effects. The results suggest that many of the estimates from the original studies were too optimistic, possibly caused by publication bias or questionable research practices, and also that some degree of heterogeneity should be expected. Moreover, the results indicate that the use of statistical significance as the only criterion for replication success may be questionable, since from a predictive viewpoint, non-significant replication results are often compatible with significant results from the original study

Lunch

12 : 00 - 13 : 30

Flying lunch

13 : 30 - 14 : 00

A novel approach to meta-analysis testing under heterogeneity

By Judith ter Schure, PhD student, Centrum Wiskunde & Informatica Amsterdam

Scientific knowledge accumulates and therefore always has a (partly) sequential nature. As a result, the exchangeability assumption in conventional meta-analysis cannot be met if the existence of a replication — or generally: later studies in a series — depends on earlier results. Such dependencies arise at the study level but also at the meta-analysis level, if new studies are informed by a systematic review of existing results in order to reduce research waste. Fortunately, studies series with such dependencies can be meta-analyzed with Safe Tests. These tests preserve type I error control, even if the analysis is updated after each new study. Moreover, they introduce a novel approach to handling heterogeneity; a bottleneck in sequential meta-analysis. This strength of Safe Tests for composite null hypotheses lies in controlling type I errors over the entire set of null distributions by specifying the test statistic for a worst-case prior on the null. If for each study such a (study-specific) test statistic is provided, the combined test controls type I error even if each study is generated by a different null distribution. These properties are optimized in so-called GROW Safe Tests. Hence, they optimize the ability to reject the null hypothesis and make intermediate decisions in a growing series, without the need to model heterogeneity.

14 : 00 - 14 : 45

Efficient designs under uncertainty: Guarantee compelling evidence with sequential Bayes factor designs

By Felix Schönbrodt, Ludwig-Maximilians-Universität München

Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research practice is not uncommon, probably because it appeals to researcher’s intuition to collect more data to push an indecisive result into a decisive region. Optionally increasing the sample size is one of the most common "questionable research practices" (John, Loewenstein, & Prelec, 2012). In my talk, I will present the "Sequential Bayes Factor" (SBF) design, which allows unlimited multiple testing for the presence or absence of an effect, even after each participant. Sampling is stopped as soon a pre-defined evidential threshold for H1 support or for H0 support is exceeded. Compared to an optimal NHST design, this leads on average to 50-70% smaller sample sizes, while having the same error rates. Furthermore, in contrast to NHST, its success is not dependent on a priori guesses of the true effect size. Finally, I give a quick overview over a priori Bayes factor design analysis (BFDA) which allosw to envisage the expected sample size (given an assumed true effect size), and also allow to set a reasonable maximal sample size for the sequential procedure.

Break

14 : 45 - 15 : 15

Coffee break

15 : 15 - 16 : 00

You can lead horses to water - but how do you make them drink ?

By Robert Matthews, Aston University Birmingham

The replication crisis has highlighted profound problems with standard inferential practice. It has also led to novel techniques for dealing with the challenges. However, unless widely adopted by the scientific community, these techniques will not achieve the goal of improving the reliability of research findings. I argue that the principal barrier to success lies in a misconception about the nature of knowledge that has persisted for millennia. As such, strong incentives must be offered to both researchers and journals to change their ways. I give some real-life examples of how such changes are being encouraged, and the price of failing to do so.

Wrap-up

16 : 00 - 16 : 30