Discussion about this post

User's avatar
Andrew Gelman's avatar

My colleagues and I discussed this issue in these three articles:

[2001] Using conditional distributions for missing-data imputation. Discussion of "Conditionally specified distributions" by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan): http://stat.columbia.edu/~gelman/research/published/arnold2.pdf

[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill): http://stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf

[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko): http://stat.columbia.edu/~gelman/research/published/mi_theory9.pdf

Expand full comment
Calvin McCarter's avatar

It's not really clear to me what the MICE procedure gets us, compared to training exclusively on the observed data. I recently proposed a method, UnmaskingTrees, which introduces new missing values (with a masking rate that is itself drawn from Uniform[0, 1]) and then trains XGBoost predictive models to impute them; then it applies these models to the actual missing values. So rather than training on its own model outputs as does MICE, UnmaskingTrees creates new (MCAR) missingness to autoregressively train itself.

As measured on downstream prediction tasks, UnmaskingTrees (https://github.com/calvinmccarter/unmasking-trees) does better (https://arxiv.org/pdf/2407.05593), though it performs worse on intrinsic performance metrics. Anecdotally, MICE adds more noise to the imputed values, probably due to the random initialization of the missing values, while UnmaskingTrees is more "conservative", especially with a high missingness rate: https://gist.github.com/calvinmccarter/41307983085f9a6acf2a38bc769d0b08 .

Expand full comment
4 more comments...

No posts