My colleagues and I discussed this issue in these three articles:
[2001] Using conditional distributions for missing-data imputation. Discussion of "Conditionally specified distributions" by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan): http://stat.columbia.edu/~gelman/research/published/arnold2.pdf
It's not really clear to me what the MICE procedure gets us, compared to training exclusively on the observed data. I recently proposed a method, UnmaskingTrees, which introduces new missing values (with a masking rate that is itself drawn from Uniform[0, 1]) and then trains XGBoost predictive models to impute them; then it applies these models to the actual missing values. So rather than training on its own model outputs as does MICE, UnmaskingTrees creates new (MCAR) missingness to autoregressively train itself.
Thanks Calvin. To clear up what multiple imputation (more generally than just MICE) gets us, Rubin’s 1996 paper ‘Multiple Imputation After 18+ Years’ is very instructive (https://doi.org/10.2307/2291635).
It points out that the procedure is designed for a well-defined estimand, where we aim to recover as far as possible the full-data inference, and wish to accurately account for uncertainty including that due to missing data. This explains why MI includes what people sometimes describe as unnecessary noise. Rubin’s wrote
‘The lesson is simple: Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit-rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective.’
From your above-linked arXiv paper (I enjoyed the bit on hyperparameter tuning is no fun!), it looks as though you are judging the quality of unmasking trees according to predictive measures, so are coming at things from a very different position to MI.
Thanks -- this is really helpful context! The imputation benchmark [Jolicoeur-Martineau et al., 2024] measures inferential performance (percent bias and confidence interval coverage rate) after performing multiple imputation (k=10) with each method, and also the diversity of multiple imputation (Mean Absolute Deviation around the median/mode). What I find interesting is that the methods that rank better in diversity do not necessarily rank better in inferential performance: https://gist.github.com/calvinmccarter/00f0ee0c1a18f0e1b63e18d0db4e5872 .
In particular, MICE-Forest and MissForest are quite similar methods, except that MICE-Forest produces more diverse multiple imputations. And yet MICE-Forest has worse coverage rate and the same percent bias. This could of course be due to shortcomings in the benchmark in measuring diversity and inferential performance, but it still makes me "Bayesian update" towards the belief that chained equations introduces bad diversity that harms inference rather than good diversity that benefits inference.
Also, thanks for sharing the Efron paper! Indeed, it is quite similar, in that its two-way linear model like XGBoost in UnmaskingTrees, allows one to not discard samples with missing features, and also constructs models by applying the missingness process to exclusively the observed data. On the other hand, its two-way linear model (and more generally, any regression model fitted with MSE loss) will fail to model the conditional predictive distributions (especially multi-modal ones), even with bootstrapping.
My colleagues and I discussed this issue in these three articles:
[2001] Using conditional distributions for missing-data imputation. Discussion of "Conditionally specified distributions" by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan): http://stat.columbia.edu/~gelman/research/published/arnold2.pdf
[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill): http://stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf
[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko): http://stat.columbia.edu/~gelman/research/published/mi_theory9.pdf
Thanks Andrew! I’d somehow missed the Political Analysis and Stat Sci ones – thanks. I nearly mentioned your Biometrika paper and Hughes’ paper (https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-28), then thought they might be a bit technical.
It's not really clear to me what the MICE procedure gets us, compared to training exclusively on the observed data. I recently proposed a method, UnmaskingTrees, which introduces new missing values (with a masking rate that is itself drawn from Uniform[0, 1]) and then trains XGBoost predictive models to impute them; then it applies these models to the actual missing values. So rather than training on its own model outputs as does MICE, UnmaskingTrees creates new (MCAR) missingness to autoregressively train itself.
As measured on downstream prediction tasks, UnmaskingTrees (https://github.com/calvinmccarter/unmasking-trees) does better (https://arxiv.org/pdf/2407.05593), though it performs worse on intrinsic performance metrics. Anecdotally, MICE adds more noise to the imputed values, probably due to the random initialization of the missing values, while UnmaskingTrees is more "conservative", especially with a high missingness rate: https://gist.github.com/calvinmccarter/41307983085f9a6acf2a38bc769d0b08 .
Thanks Calvin. To clear up what multiple imputation (more generally than just MICE) gets us, Rubin’s 1996 paper ‘Multiple Imputation After 18+ Years’ is very instructive (https://doi.org/10.2307/2291635).
It points out that the procedure is designed for a well-defined estimand, where we aim to recover as far as possible the full-data inference, and wish to accurately account for uncertainty including that due to missing data. This explains why MI includes what people sometimes describe as unnecessary noise. Rubin’s wrote
‘The lesson is simple: Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit-rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective.’
From your above-linked arXiv paper (I enjoyed the bit on hyperparameter tuning is no fun!), it looks as though you are judging the quality of unmasking trees according to predictive measures, so are coming at things from a very different position to MI.
By the way, your description of unmasking trees sounds very similar in spirit to Efron’s full-mechanism bootstrap (https://doi.org/10.2307/2290846 with a description and evaluation here https://doi.org/10.1177/0962280214526216).
Thanks -- this is really helpful context! The imputation benchmark [Jolicoeur-Martineau et al., 2024] measures inferential performance (percent bias and confidence interval coverage rate) after performing multiple imputation (k=10) with each method, and also the diversity of multiple imputation (Mean Absolute Deviation around the median/mode). What I find interesting is that the methods that rank better in diversity do not necessarily rank better in inferential performance: https://gist.github.com/calvinmccarter/00f0ee0c1a18f0e1b63e18d0db4e5872 .
In particular, MICE-Forest and MissForest are quite similar methods, except that MICE-Forest produces more diverse multiple imputations. And yet MICE-Forest has worse coverage rate and the same percent bias. This could of course be due to shortcomings in the benchmark in measuring diversity and inferential performance, but it still makes me "Bayesian update" towards the belief that chained equations introduces bad diversity that harms inference rather than good diversity that benefits inference.
Also, thanks for sharing the Efron paper! Indeed, it is quite similar, in that its two-way linear model like XGBoost in UnmaskingTrees, allows one to not discard samples with missing features, and also constructs models by applying the missingness process to exclusively the observed data. On the other hand, its two-way linear model (and more generally, any regression model fitted with MSE loss) will fail to model the conditional predictive distributions (especially multi-modal ones), even with bootstrapping.
Update: this paper seems to be the/a culprit
https://doi.org/10.1080/10629360600810434