MICE is not a Gibbs sampler

(A probably pointless bugbear)

Jul 09, 2024

Hey, it’s been a while since I posted! I actually have 10 posts in draft from the last few weeks but haven’t had the motivation to finish them off. Here’s one that’s finished enough.

I love a good bugbear. A good bugbear is a source of irritation that (sometimes) sufficiently energises me to do something. Bugbears are often about common misunderstandings. For instance, our paper defining what sensitivity analysis means came from annoyance that people interpreted the term to mean ‘literally any other analysis I can think of’. When I heard people use it this way, it kept introducing false reassurance or confusion.

I did say good bugbear. Twice. There are bad bugbears. These are the ones that irritate you but either you have no outlet or putting it right makes no practical difference to anything so they are just a pointless energy drain.

Why is MICE not a Gibbs sampler?

Here is a bad bugbear I have: people keep describing the multivariate multiple-imputation procedure known as MICE (Multivariate Imputation by Chained Equations) as ‘a Gibbs sampler’. Unless I’ve misunderstood what constitutes a Gibbs sampler – which is possible and I guess we’ll find out in the comments – it’s not.

This classic paper on MICE describes it as follows:

The first variable with missing values, x₁ say, is regressed on all other variablesx₂, …, x_k, restricted to individuals with the observed x₁. Missing values in x₁ are replaced by simulated draws from the corresponding posterior predictive distribution of x₁. Then, the next variable with missing values, x₂ say, is regressed on all other variables x₁, x₃, …, x_k, restricted to individuals with the observed x₂, and using the imputed values of x₁. Again, missing values in x₂ are replaced by draws from the posterior predictive distribution of x₂. The process is repeated for all other variables with missing values in turn: this is called a cycle.
– White, Royston & Wood (2010)

This sounds quite like a Gibbs sampler doesn’t it? We fix the current imputed values of the other x’s and impute this one, visiting each of x₁, …, x_k in turn. So?

In the above description, note that the regression model is restricted to individuals with the observed x each time. So it’s not actually doing things one-at-a-time! As you’ll see in that paper, ‘proper’ imputation involves first taking a draw of the parameters of the imputation model, and then drawing the missing values in x, given the current parameter draw, from the posterior predictive distribution. In MICE, fitting of the imputation model is only done in individuals with observed x. Previously imputed values of this x are not included when the imputation model is fitted.

Ultimately I don’t think this point really matters, so it’s probably a bad bugbear. Here’s the small way in which it does matter: with a Gibbs sampler, you’d expect to need lots of of iterations of this process. With MICE, you don’t. Software defaults are, from memory, 5 (R package {mice}), 10 (Stata’s {mi impute chained}) or 20 (SAS’s {proc mi}, fcs statement). This number is surprisingly low because MICE is not a Gibbs sampler: the it takes a short-cut that reduce what would be an unhelpful autocorrelation, meaning more cycles would be needed.

A couple of people have blamed van Buuren’s book for describing MICE as a Gibbs sampler, but I just checked and it doesn’t seem to. It does say ‘if the conditionals are compatible’ but I don’t think that’s saying MICE is a Gibbs sampler, it’s saying ‘MICE behaves like a Gibbs sampler under certain conditions’.

By the way, MICE is a type of fully-conditional specification, and it’s not the only one. At one point Jonathan Bartlett mentioned that his smcfcs approach does not do it the way MICE does (i.e. it keeps the currently-imputed values of x when fitting the imputation model for x).

Bad bugbears

Ok here are a few bad bugbears:

Minor typographic or grammatical things. I get slight irritation when people use a hyphen instead of an en-dash in case–control (no it’s not a compound adjective and could equivalently be named a control–case study), or Box–Cox or Kaplan–Meier (no those are not double-barrelled surnames). This is classic bugbear material because typographic conventions a) take time to learn and b) convey subtle information. Counterpoint: people who don’t know what subtle information is conveyed aren’t hurt by now knowing (they’re not thinking Kaplan and Meier had a baby and gave it a double-barrelled surname). I’m talking to myself here, not accusing you, and obviously for some professions it is important than others.
Being picky about a particular referencing style in someone’s draft instead of commenting on substance. ‘This isn’t Vancouver I want Vancouver’. Who cares if someone put the publication year after the authors instead of after the journal name of whatever! I think this is worse than typographic bugbears because it’s just making a comment for the sake of it: it took you time to learn particular styles, and once you know you know, but ultimately getting it perfect doesn’t change the value of the information (in the words of Principal Skinner, ‘prove me wrong, kids’).
Other fields’ terminology or ways of looking at things. Think about the way economists and epidemiologists use the term selection bias differently. Also ‘logistic regression is a classification algorithm’ vs. ‘logistic regression is regression’. Sure, I think of logistic regression as a regression model because I was taught about it as a generalised linear model but I’m not sure I can bring myself to care if other people regard it a classification algorithm. Other than not properly crediting statisticians, how much does it actually matter? Statisticians don’t get credited all the time (see this example from Maarten van Smeden)!
Finding new words annoying. Like learnings (yet it reminds me of that Calvin & Hobbes strip that ends with verbing weirds language, credit to Matt Sydes for pointing that out) and corporate-speak words. You might as well just get used to new terms – especially if you have to use them (‘List four learnings’). Hey, they might go away!

What are yours?

Andrew Gelman

Aug 12

My colleagues and I discussed this issue in these three articles:

[2001] Using conditional distributions for missing-data imputation. Discussion of "Conditionally specified distributions" by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan): http://stat.columbia.edu/~gelman/research/published/arnold2.pdf

[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill): http://stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf

[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko): http://stat.columbia.edu/~gelman/research/published/mi_theory9.pdf

Expand full comment

1 reply by Tim Morris

Calvin McCarter

Aug 29

It's not really clear to me what the MICE procedure gets us, compared to training exclusively on the observed data. I recently proposed a method, UnmaskingTrees, which introduces new missing values (with a masking rate that is itself drawn from Uniform[0, 1]) and then trains XGBoost predictive models to impute them; then it applies these models to the actual missing values. So rather than training on its own model outputs as does MICE, UnmaskingTrees creates new (MCAR) missingness to autoregressively train itself.

As measured on downstream prediction tasks, UnmaskingTrees (https://github.com/calvinmccarter/unmasking-trees) does better (https://arxiv.org/pdf/2407.05593), though it performs worse on intrinsic performance metrics. Anecdotally, MICE adds more noise to the imputed values, probably due to the random initialization of the missing values, while UnmaskingTrees is more "conservative", especially with a high missingness rate: https://gist.github.com/calvinmccarter/41307983085f9a6acf2a38bc769d0b08 .

2 replies by Tim Morris and others

4 more comments...

Statistical methodology meanderings

Discussion about this post