Many analyses
A common concern researchers have – about our own work as well as others’ – is that decisions we make about the analysis will impact on our eventual inference. Lots of people try to be explicit and argue for the decisions they made, but the fact we argue for them suggests that others might have made different choices. We then worry about how sensitive our inference is to our particular choices.
I recently travelled to Germany for the first time to attend an excellent symposium, ‘Navigating the Landscape of Scientific Integrity in the Field of Medicine’, at LMU CAS in Munich. One of the first talks, by Balazs Aczel, was on the topic of so-called ‘multi-analyst studies’.
Disclaimer: I have a pretty superficial understanding of some multiple analysis studies, so may be caricaturing/misrepresenting them below.
So Balazs Aczel’s talk began by describing analyses that involve trying every possible choice and presenting all the results. I believe he referred to this as ‘multiverse analysis’, though looking again at papers about multiverse analysis, every possible choice doesn’t seem quite accurate. He said ‘just because you took all the combinatorically possible choices, doesn’t mean they make sense’.
Good point. What people tend to do is then exclude choices that make no sense. I think this is actually implicit or explicit in multiverse papers. Also in specification curve analysis, e.g. this paper, the authors reasonably say you should exclude ‘combinations that are invalid or redundant’.
The bar for an analysis to remain ‘in’ – that it not be invalid or redundant – is really low. Clearly the ‘in’ analyses are not alike: some will make stronger assumptions, less plausible assumptions, ‘quite silly’ but not necessarily invalid choices, etc. Ideally we would give more weight to analyses that make with weaker and/or more plausible assumptions. Trying to quantify weakness and plausibility would start a never-ending argument; let’s just agree that it could not be done satisfactorily. It seems more feasible to weight each ‘in’ analyses according to the probability of researchers making that particular choice.
You could view multi-analyst studies as doing this empirically, by forcing each research team to make their set of choices. This is neat, but there are a few issues. First, sample size (in terms of the number of teams) means we can only hope for a coarse approximation to the distribution of choices that might be made by all qualified1 researchers. Suppose there were just 10 binary decisions, meaning 2^10 possible analyses, and each choice might be made by some analyst team, but not uniformly… I think you see the problem. At best we might hope for tens of teams. The second issue us that not every team has the same expertise. You might argue that it doesn’t matter because we want some idea about the results a spectrum of researchers might make given the same data. But we got to this point because we don’t believe every non-invalid, non-redundant analysis is equal.
The third issue seems like an elephant in the room with multi-analyst studies: nominally, they only focus at analysis decisions. The group I work in currently has four programmes of methodology research, broadly labelled Design, Conduct, Analysis, and Meta-analysis. My role is spread across the Analysis, Meta-analysis and Design programmes. Loads of big decisions are made during study design and conduct. Sensitivity to those choices might dwarf sensitivity to analysis choices but multi-analyst studies only focus on analysis.
I spent a moment thinking what if we wished to incorporate variability due to design and data collection decisions… imagine if we asked groups of people to design and analyse their own studies. Then I realised I was sort of re-inventing meta-analysis! The individual study results and estimation of heterogeneity in meta-analysis tells us something about this (the actual pooling does not). Interestingly, an often-touted advantage of IPD2 meta-analysis is the ability to harmonise the analysis of each contributing study (see e.g. here), which is in contrast to multi-analyst studies rather than extension. Of course, you could go further and do multi-meta-analyst studies.

Vaguely related to the above, my talk at the LMU CAS workshop was on ‘Analysis plans, estimands, sense and sensitivity’. I mentioned our sense-check definition of relevant sensitivity analyses and pointed out that multi-analyst studies do not necessarily follow requirements 1 or 3 of a sensitivity analysis.
Must target the same estimand
Must be able to give different answers (i.e. you’re not doing the same analysis by two names)
Must have some uncertainty about which you would bet on if they give different answers
A paper by Katrin Auspurg and Josef Brüderl makes the first point with respect to a many-analysts study. Essentially, Auspurg and Brüderl’s criticism is that the analyst teams were given a vague question. Essentially, teams inferred an estimand. The distribution of results seen in that study, Auspurg and Brüderl argue, was not a reflection of analysis decisions, but of different teams picking different estimands (whether intentionally or unwittingly3). They show that, once we settle on an estimand, the amount of variation goes right down.
I think Auspurg and Brüderl’s main point is absolutely right. Sources of variation tend to matter4, and ultimately it is dishonest to describe variation in the choice of estimand as variation in analysis choices, even if the analyst teams weren’t expert enough to realise.
Sabine Hoffmann had an interesting take: we should judge variation in results that would lead to the same headline interpretation, rather than results that have the same estimand. Using the example discussed in the papers above, many different estimands may lead to the same blunt headline ‘skin tone affects red cards’. This argument doesn’t change my mind but I really appreciated this fresh perspective – sufficiently thought-provoking to want to record here (with consent)! I should say here that I don’t know if this is Sabine’s view is strongly-held; it may have just been musing or playing devil’s advocate.
All in all, I’m really grateful to Anne-Laure Boulesteix and Sabine Hoffmann for organising such a excellent workshop! Loads of superb talks and discussions.
In some vague sense.
IPD = Individual Participant Data.
lol just kidding it was definitely unwitting if you look at what the various teams did
If you don’t believe me, ask Stephen Senn

