Simulation studies: correlation when methods are applied to the same dataset

I will get to the point by the end

Apr 02, 2024

Our thinking on simulation studies has come a long way in the last couple of decades. A few years ago I gave a talk on simulation and Hans van Houwelingen asked if we should worry about issues with random-number generators. I was surprised by the question. It’s interesting reading Brian Ripley’s book, where early on he discusses some simple algorithms for pseudo-random number generation (and their flaws)1. I’m not a good enough mathematician to critique modern RNG algorithms, but my understanding is that fundamental concerns have been addressed, and we can trust the RNGs implemented in most statistical software are fit-for-purpose. This permits us to focus on more conceptual issues, which suits me well.

Last week I visited Amsterdam and attended an excellent symposium (‘Time to Get Real! Using real-world data for health-economic outcomes research: are methods up to the task?’). I started my talk describing how I got into working on simulation studies. As a PhD student in 2010–13, I had needed to use simulation for various projects. Realising that I knew very little about them, I wanted to attend a course and several times asked my supervisors, Ian White and Patrick Royston, if they could recommend one. They always came up blank. At the time, there weren’t any. Given how many statisticians used simulation studies and how many admitted they had just learned blind (for better or worse), this surprised me. I eventually realised how lucky I was to be working with Ian and Patrick given their expertise on simulation studies. They had loads of expertise, and Ian and I started coming up with a lot of ideas and principles about simulation studies that I hadn’t seen anywhere.

Sidenote: advice for scaling up an interactive, online course?

After finishing my PhD, I approached Ian White and Michael Crowther to suggest that we run a short course, which is what our tutorial paper was based on2. The course has been popular – this year it sold out within five days. A few weeks later, we have a waiting list over half the course capacity (40 people).

I’m writing this section because I’d like advice on how to scale it up. The obvious solution is to run it twice a year, which I’m not keen to do because I’m working on a lot of other stuff and though running a short course is fun, the prep is time-consuming and the course tiring. But I do feel bad to people who email saying they urgently need it and didn’t get a place, especially PhD students who then have to learn some of this the way I did (see above), though admittedly they have our tutorial now. Our current course is interactive. Attendees work together on their own simulation study in breakout rooms, which they present at a final session. We pop in and out of breakout rooms and there’s plenty of space for general and specific questions. I don’t want to lose these aspects but don’t know how we could squeeze more people into the same time and keep everything.

I imagine there are lots of ways to make the course better for online learning that I don’t know about, and that we could probably do it on a larger scale. So I’d love to hear how you have managed this, have links to ideas, resources, etc.

Anyway, back to the programme…

Dependence between simulated datasets

When I asked around for resources on simulation studies, people would often recommend the paper by Burton, Altman, Royston and Holder3. I found it good for some concepts, but got frustrated with various things4, which was in part why we started the course and wrote our tutorial paper. Their main structure, in figure 1, was IMO more like a brainstorm… like a really tangled-up version of ADEMP, though with a few different points. Patrick Royston, one of my supervisors, was a co-author. At one point I told him I didn’t like the paper much, and explained why. To my surprise, rather than get defensive, he agreed!

One point I think is misdirected in Burton et al. was the emphasis on dependence between simulated datasets. My view is that this is both 1) useful for the design and 2) ignorable in the analysis. Well, usually ignorable at least.

Why do we see dependence?

Here’s how it works. For each simulated dataset, we analyse using 2+ methods. People often do this without a second thought. It would seem a waste not to analyse each dataset with each method! However, it’s worth understanding that this does induce a correlation between results from different methods in the ‘estimates dataset’. An example from a simulation study I did ages ago is given in the figure here. You can see the positive correlation when you apply the two methods to the same simulated datasets.

Scatterplot showing point estimates from method A vs. method B when each method is applied to the same simulated dataset

Burton et al. give this great prominence: it’s point 2a in their figure 1, appearing (bizarrely) before even stating how data are to be generated.

By the way, if we wanted no correlation between results of methods, we could easily remove it. Instead of applying 2+ methods of analysis to each simulated dataset, we could simulate a more datasets and apply only one method to each. If you think this is a waste of computational time, I disagree. Generating datasets typically takes up very little of the computational resource of a simulation study (my memory is poor but I think Tra My Pham and Alessandro Gasparini have pointed this out to me on different occasions). I’ve seen people do it independently and you absolutely can. I don’t think you should, at least in general.

Why is dependence a good thing?

Correlation arising from applying multiple methods to a single simulated dataset is often a good thing. Simulation studies are subject to Monte Carlo error. This is variation due to using simulation (you might be interested in this post about Monte Carlo error in multiple imputation). Why is it good? Well, suppose we simulate a particular bunch of datasets that lead to a truly unbiased method appearing biased. This apparent bias may concern us. If all our methods were applied to the same simulated datasets, they would likely be affected in the same way, and this would hint that apparent bias may be (partly or wholly) due to Monte Carlo error. Sometimes a simulation study includes one or more methods for which we have theoretical knowledge (e.g. this method is unbiased or has bias of known magnitude), meaning we can check the bias of other methods against it. ‘Method A is known to be unbiased but here appears slightly biased, so perhaps we need to take the apparent bias of some other methods with a pinch of salt’ (note: this is a hint that you should probably add more repetitions… not replace the existing ones).

Why can we usually ignore the dependence?

We would need to account for the dependence of our estimates if the analysis of our simulation study involved contrasting the performance of methods. So for example ‘what is the difference in bias of method A vs. method B’, or ‘does method A have significantly more bias than method B’. Such contrasts are not usually of much interest in simulation studies. Why not? Because for most performance measures we quantify absolute rather than relative performance. For example:

Bias: aim for bias to be zero and compare estimated bias with 0
Coverage: aim for nominal level (e.g. 95%) and compare estimated coverage with 95
Empirical SE: aim for lowest possible (at least for a given bias; see tutorial paper)
Power: aim for highest as possible (at least for given type I error; again see tutorial)

If we are interested in relative performance (difference-in-bias-between-methods etc.), the dependence between results that come from analyses of the same dataset does not even matter for the estimated difference/contrast itself! It matters when we come to estimating Monte Carlo error5. So if you ignore the dependence when computing the (Monte Carlo) standard error of the contrast, you will be slightly conservative. That is, you should calculate MCSE using something that looks like

\(\text{Var}(A)+\text{Var}(B)-2\text{Cov}(AB).\)

Ignoring the final term will slightly overestimate MCSE. This could be a problem if it leads you to conclude that the bias of method A is no worse than that of method B, when accouning for their correlation would help you reduce the noise in this comparison.

Summary

This post was prompted by hearing someone say how important it is to account for the dependence between simulated datasets when we come to the analysis. I suspect this came from reading Burton et al. I hope the above clarifies two points:

Dependence between simulated datasets is a good thing;
The situations where you actually have to account for it in the analysis are very niche, and the consequence of not doing so is that you get slightly conservative estimates of Monte Carlo SE.

BD Ripley. Stochastic Simulation. Wiley, 1987. doi:10.1002/9780470316726

TP Morris, IR White, MJ Crowther. Using simulation studies to evaluate statistical
methods. Statistics in Medicine. 2019; 38(11): 2074–2102. doi:10.1002/sim.8086

A Burton, DG Altman, P Royston, RL Holder. The design of simulation studies in
medical statistics. Statistics in Medicine. 2006; 25(24): 4279–4292. doi:10.1002/sim.2673

If there is interest I could write a longer critique. I don’t think that’s particularly needed or fair – it was a useful paper for the time – but I know some people can’t see the issues and so it might be helpful to explain.

If this isn’t obvious, have a look at the expression for variance of the A-minus-B contrast. Suppose you are interested in E(A) then it’s clear that when you estimate it, you would use Var(A) to quantify Monte Carlo error. The Cov(A,B) doesn’t have carry information about it unless there is missing data on A. But even then it’s inappropriate to use this information in simulation studies! I was going to say that’s another post for another time but it’s going to feature in a paper I’ve been working on for way too long.

Apr 2, 2024

Edit: thanks to Rick Wicklin for gently pointing out the sloppy language in ‘dependence between simulated datasets‘. I‘ve now edited (I think) to clarify.

Also, Björn Siepe noted a pre-print from Joshua B. Gilbert (https://arxiv.org/abs/2401.07294), which I must read as this post seems (superficially) to contradict it.

Expand full comment

Statistical methodology meanderings

Discussion about this post