Hey, it’s been a while since I posted! I actually have 10 posts in draft from the last few weeks but haven’t had the motivation to finish them off. Here’s one that’s finished enough.
I love a good bugbear. A good bugbear is a source of irritation that (sometimes) sufficiently energises me to do something. Bugbears are often about common misunderstandings. For instance, our paper defining what sensitivity analysis means came from annoyance that people interpreted the term to mean ‘literally any other analysis I can think of’. When I heard people use it this way, it kept introducing false reassurance or confusion.
I did say good bugbear. Twice. There are bad bugbears. These are the ones that irritate you but either you have no outlet or putting it right makes no practical difference to anything so they are just a pointless energy drain.
Why is MICE not a Gibbs sampler?
Here is a bad bugbear I have: people keep describing the multivariate multiple-imputation procedure known as MICE (Multivariate Imputation by Chained Equations) as ‘a Gibbs sampler’. Unless I’ve misunderstood what constitutes a Gibbs sampler – which is possible and I guess we’ll find out in the comments – it’s not.
This classic paper on MICE describes it as follows:
The first variable with missing values, x1 say, is regressed on all other variablesx2, …, xk, restricted to individuals with the observed x1. Missing values in x1 are replaced by simulated draws from the corresponding posterior predictive distribution of x1. Then, the next variable with missing values, x2 say, is regressed on all other variables x1, x3, …, xk, restricted to individuals with the observed x2, and using the imputed values of x1. Again, missing values in x2 are replaced by draws from the posterior predictive distribution of x2. The process is repeated for all other variables with missing values in turn: this is called a cycle.
– White, Royston & Wood (2010)
This sounds quite like a Gibbs sampler doesn’t it? We fix the current imputed values of the other x’s and impute this one, visiting each of x1, …, xk in turn. So?
In the above description, note that the regression model is restricted to individuals with the observed x each time. So it’s not actually doing things one-at-a-time! As you’ll see in that paper, ‘proper’ imputation involves first taking a draw of the parameters of the imputation model, and then drawing the missing values in x, given the current parameter draw, from the posterior predictive distribution. In MICE, fitting of the imputation model is only done in individuals with observed x. Previously imputed values of this x are not included when the imputation model is fitted.
Ultimately I don’t think this point really matters, so it’s probably a bad bugbear. Here’s the small way in which it does matter: with a Gibbs sampler, you’d expect to need lots of of iterations of this process. With MICE, you don’t. Software defaults are, from memory, 5 (R package {mice}), 10 (Stata’s {mi impute chained}) or 20 (SAS’s {proc mi}, fcs statement). This number is surprisingly low because MICE is not a Gibbs sampler: the it takes a short-cut that reduce what would be an unhelpful autocorrelation, meaning more cycles would be needed.
A couple of people have blamed van Buuren’s book for describing MICE as a Gibbs sampler, but I just checked and it doesn’t seem to. It does say ‘if the conditionals are compatible’ but I don’t think that’s saying MICE is a Gibbs sampler, it’s saying ‘MICE behaves like a Gibbs sampler under certain conditions’.
By the way, MICE is a type of fully-conditional specification, and it’s not the only one. At one point Jonathan Bartlett mentioned that his smcfcs approach does not do it the way MICE does (i.e. it keeps the currently-imputed values of x when fitting the imputation model for x).
Bad bugbears
Ok here are a few bad bugbears:
Minor typographic or grammatical things. I get slight irritation when people use a hyphen instead of an en-dash in case–control (no it’s not a compound adjective and could equivalently be named a control–case study), or Box–Cox or Kaplan–Meier (no those are not double-barrelled surnames). This is classic bugbear material because typographic conventions a) take time to learn and b) convey subtle information. Counterpoint: people who don’t know what subtle information is conveyed aren’t hurt by now knowing (they’re not thinking Kaplan and Meier had a baby and gave it a double-barrelled surname). I’m talking to myself here, not accusing you, and obviously for some professions it is important than others.
Being picky about a particular referencing style in someone’s draft instead of commenting on substance. ‘This isn’t Vancouver I want Vancouver’. Who cares if someone put the publication year after the authors instead of after the journal name of whatever! I think this is worse than typographic bugbears because it’s just making a comment for the sake of it: it took you time to learn particular styles, and once you know you know, but ultimately getting it perfect doesn’t change the value of the information (in the words of Principal Skinner, ‘prove me wrong, kids’).
Other fields’ terminology or ways of looking at things. Think about the way economists and epidemiologists use the term selection bias differently. Also ‘logistic regression is a classification algorithm’ vs. ‘logistic regression is regression’. Sure, I think of logistic regression as a regression model because I was taught about it as a generalised linear model but I’m not sure I can bring myself to care if other people regard it a classification algorithm. Other than not properly crediting statisticians, how much does it actually matter? Statisticians don’t get credited all the time (see this example from Maarten van Smeden)!
Finding new words annoying. Like learnings (yet it reminds me of that Calvin & Hobbes strip that ends with verbing weirds language, credit to Matt Sydes for pointing that out) and corporate-speak words. You might as well just get used to new terms – especially if you have to use them (‘List four learnings’). Hey, they might go away!
What are yours?
My colleagues and I discussed this issue in these three articles:
[2001] Using conditional distributions for missing-data imputation. Discussion of "Conditionally specified distributions" by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan): http://stat.columbia.edu/~gelman/research/published/arnold2.pdf
[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill): http://stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf
[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko): http://stat.columbia.edu/~gelman/research/published/mi_theory9.pdf
It's not really clear to me what the MICE procedure gets us, compared to training exclusively on the observed data. I recently proposed a method, UnmaskingTrees, which introduces new missing values (with a masking rate that is itself drawn from Uniform[0, 1]) and then trains XGBoost predictive models to impute them; then it applies these models to the actual missing values. So rather than training on its own model outputs as does MICE, UnmaskingTrees creates new (MCAR) missingness to autoregressively train itself.
As measured on downstream prediction tasks, UnmaskingTrees (https://github.com/calvinmccarter/unmasking-trees) does better (https://arxiv.org/pdf/2407.05593), though it performs worse on intrinsic performance metrics. Anecdotally, MICE adds more noise to the imputed values, probably due to the random initialization of the missing values, while UnmaskingTrees is more "conservative", especially with a high missingness rate: https://gist.github.com/calvinmccarter/41307983085f9a6acf2a38bc769d0b08 .