On ‘meta-models’ for the anaysis of simulation studies

Justifying the ‘conventional wisdom’ seems worthwhile

Oct 01, 2024

Note: This post assumes you’re familiar with the ideas of simulation studies and the language in our 2019 tutorial1. If you’re not, but you do simulation studies, read that paper rather than this post!

Nearly a quarter of a century ago, Anders Skrondal wrote a paper titled ‘Design and Analysis of Monte Carlo Experiments: Attacking the Conventional Wisdom’2. A compelling title.

The paper describes how the author typically saw simulation studies (‘Monte Carlo experiments’) being done and outlined three problems:

Monte Carlo error (‘statistical precision’ of the obtained results);
Generalisability (‘external validity’) of simulation results beyond scenarios explored;
Computational resource constraints.

This is really one constraint (third bullet) that creates two problems. Essentially we can’t always include all the data-generating mechanisms we’d might like to, and we can’t use as many repetitions as we’d like to achieve suitably small Monte Carlo error. Skrondal constructively proposed two solutions:

Specify a model (‘meta-model’) for the analysis of simulation results;
Pick your DGMs according to a fractional factorial design that makes the meta-model efficient.

The paper is an enjoyable read. Unfortunately I have not yet found a compelling use-case for its solutions. Despite mentioning it on many, many courses, I’ve seen nearly no enthusiasm for fractional factorial designs in simulation studies (I know of two people who have used the idea and both reported regrets). People do seem far more enthusiastic about using models the analysis of simulation results.

In defence of restriction/categorisation

Ok let’s go. Suppose we run a simulation study. We vary the sample size n_obs and compare two methods of analysis. We obtain the following results for bias and coverage (literally plucked out of thin air).

Like all simulation results, these are not the exact bias and coverage of methods A and B, but empirical estimates of them. The variation due to simulation is termed Monte Carlo error (typically quantified as Monte Carlo standard error or confidence interval; I’ve intentionally used neither here). If we saw the above results after five repetitions for each setting, we would anticipate far higher Monte Carlo error than after a billion reps.

Consider the accuracy of our estimated bias or coverage. The performance estimates depicted above, which have been calculated for each method at each sample size, i) are unbiased, ii) may have high Monte Carlo error. Can we do reduce the Monte Carlo error? Perhaps. One option is simply to increase the number of repetitions (MCSE of bias is a function of √1/reps). Another is by using models.

Ignoring extrapolation for now, the idea of using a model is that we are ignoring some information in the data when we simply restrict to (say) method A at n_obs=200 and estimate coverage. Enthusiasm might be:

Looking at the figure above, we should fit a line/curve through the points instead of restricting (which is a bit like categorising).
When methods A and B are applied to the same simulated datasets, their results are correlated. Ignoring this wastes information. Having a model that accounts for it must reduce Monte Carlo error. See this post (face it, we’re going to report bias of method A and bias of method B, not their difference, and it’s on the difference that you gain precision).
We have 2+ performance measures and performance estimates are correlated (e.g. we know that bias ⬆️ → coverage ⬇️) so we can estimate performance on this measure taking information from others using some multivariate model.
We can fit some prediction model to results, which will inform us about the important DGM variables and remove the unimportant ones, so we know which ones matter.
Relatedly, we might be confident that higher-order interactions between DGM variables do not exist. Rather than estimating them, we set them to 0, and that’s how we gain precision (this was Skrondal’s approach). This invokes the marginality principle (on which I have… views3).

In all these cases, there is an idea that we are usually ignoring some ancillary information that would help us reduce Monte Carlo error. I don’t think it’s necessarily true for any of the cases.

In the third bullet, for example, that correlation is going to gain precision if we make strong modelling assumptions, or if have missing data in one measure when the other is observed (for coverage & bias, it’s seems pretty unlikely that we would have confidence bounds but no point estimate, though the reverse is of course possible).

Model misspecification

The thing about all this is that you have to be careful about specification of your model. If not, your estimates of simulation performance could be biased in a way that’s nothing to do with the method’s performance. Clearly you don’t want to risk that. To know that the model is right, you need some theoretical knowledge. If you already have that knowledge, your simulation study is probably a ‘check’ so you wouldn’t really need the model.

There are exceptions. If we’ve varied sample size n_obs, as in my hypothetical example above, we might be happy to model empirical SE as a function of √1/n_obs.

This leads to two questions. First, how do we weight each value of n_obs when estimating performance using meta-models (e.g. in the above figure we would have four data-points for empirical SE). Equally? According to Monte Carlo precision (empirical SE has lower Monte Carlo SE a n_obs increases)? Some other rationale?

Generalisation/extrapolation?

One of the claims Skrondal made about meta-models is that they allow extrapolation beyond the settings actually explored in the simulation study. Come on now.

So…

Generally I don’t think models are a particularly useful idea for estimating performance in simulation studies. Lots of the reasons we use models for empirical data are to do with having to make do with what we’ve got. It would be impossible (or would take a lot of waiting) to recruit 2,000 people aged 74 years 5 months 2 days, so an analysis cannot just estimate something at that age by restricting to people of that age. Instead we have to take what we have and rely on smoothing across the ages of participants we got. (Note in this analogy, age is a substitute for a specific DGM.) This is typically not the case for simulation studies: we can simply increase the number of repetitions.

I’m jumpy about using models for simulation results. You don’t want to claim problems like ‘bias’ of certain methods that actually arise from the way you analysed the results of your simulation study. Meta!

Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019; 38: 2074–2102. https://doi.org/10.1002/sim.8086

Skrondal A. Design and Analysis of Monte Carlo Experiments: Attacking the Conventional Wisdom. Multivariate Behavioral Research. 2000; 35(2): 137–167, doi:10.1207/S15327906MBR3502_1

Morris TP, van Smeden M, Pham TM. The marginality principle revisited: Should “higher-order” terms always be accompanied by “lower-order” terms in regression analyses? Biometrical Journal, 2023; 65: 2300069. https://doi.org/10.1002/bimj.202300069

Statistical methodology meanderings

Discussion about this post