Self-efficiency, information anchoring and frequentist calibration in reference-based imputation

If your estimator is not self-efficient, it needs rescuing!

Nov 07, 2023

Some background

Carpenter, Roger and Kenward introduced reference-based multiple imputation for trials in 2013. The key idea is that you can reflect MNAR mechanisms in multiple imputation by referring behaviour on one arm of a trial to behaviour on another. Suppose participants who withdraw from data collection in the ‘research’ arm of the trial also withdraw from (ongoing) treatment. Contrast the following:

Reference-based: ‘We assume the distribution of unobserved outcomes was comparable to the distribution of observed outcomes on the control arm.’
Delta: ‘We assume that unobserved outcomes would have been delta higher than those on the research arm under missing-at-random.’ [With delta specified on the imputation-model’s scale, e.g. the difference in conditional log-odds for a binary outcome.]

So the reference-based assumption given in (1) is a nice way to qualitatively describe an assumption about what the distribution of missing data would have looked like. The delta assumption describes this quantitatively. Delta (as described) shifts the distribution by a constant, but reference-based can in principle change any aspect of the distribution. Delta can of course be adapted to describe other aspects of a distribution.

Carpenter, Roger and Kenward’s paper received a letter-to-the-editor from Seaman, White and Leacy, which pointed out that Rubin’s variance estimator was upwardly biased for the repeated-sampling variance of the MI point estimator. As part of her PhD work, Suzie Cro developed the concept of information-anchored inference, and showed that Rubin’s variance estimator is approximately information-anchored (Cro, Carpenter and Kenward).

Jonathan Bartlett wrote the provocatively-titled paper ‘Reference-Based Multiple Imputation—What is the Right Variance and How to Estimate It’. Loosely, he argues that reference-based assumptions are strong, and the frequentist repeated-sampling variance represents the strength of the assumption one has made. It follows that you want the following property of your variance estimator:

\(E[\widehat{Var}(\widehat{\theta})]=Var(\widehat{\theta}).\)

Frequency calibration, yeeeeah. I think a key sentence of Jonathan’s view is in section 3 of his paper:

If one wishes to perform information anchored sensitivity analyses, then we believe the correct solution is to construct missing data assumptions which differ to those made by the primary analysis but which genuinely neither add nor remove information, with information being judged in terms of the estimator’s true repeated sampling variance.
– Bartlett (2021)

This is confusing on a professional level

For a long time I found these arguments awkward, because some of the statisticians I respect and trust most in the world are on different sides. My line manager (James Carpenter) on one along with a long-time collaborator I really admire (Suzie Cro) and my former MSc tutor (Mike Kenward); and my PhD supervisor (Ian White) on the other along with other long-time collaborator I really admire (Jonathan Bartlett and Shaun Seaman). Anyway, I’ve picked one. Let me tell you why.

Self-efficiency has joined the chat

The contribution I have to this discussion is to use so-called self-efficiency to help us decide which is the right inference. The below is lengthy, so to summarise what I’ll say: when a point estimator is self-inefficient, frequentist calibration is how we put the nail in the coffin, and information-anchoring is how we rescue it.

One of the most important papers in multiple imputation (and indeed missing data) theory is Meng 1994. We think of it as the uncongeniality paper, but Meng defines something important in his setup.

The key assumption here is that the analyst's complete-data estimator must be self-efficient, a condition that will be defined shortly, but basically prevents the analyst from using statistically ill-constructed estimators.
– Meng (1994)

So what is self-efficiency?

The RMI variance combining rule is based on the common assumption/intuition that the efficiency of our estimators decreases when we have less data. However, there are estimation procedures that will do the opposite, that is, they can produce more efficient estimators with less data. Self-efficiency is a theoretical formulation for excluding such procedures. When a user, typically unaware of the hidden self-inefficiency of his choice, adopts a self-inefficient complete-data estimation procedure to conduct an RMI inference, the theoretical validity of his inference becomes a complex issue, as we demonstrate.
– Meng and Romero (2003)

So broadly they find lack of self-efficiency a sufficient condition to exclude a procedure for handling missing data.

Two properties under self-inefficiency: while one lives the other cannot, or something

Existing reference-based procedures are self-inefficient. I say existing because in future we might be able to construct reference-based imputation procedures that are self-efficient.

I think it helps to consider three things. Self-efficiency requires that

\(Var(\widehat{\theta_{RB}}) \geq Var(\widehat{\theta_{full}}),\)

where the subscripts RB and full denote reference-based and full-data analysis respectively. As I mentioned above, existing reference-based estimators lack this property.

Randomisation validity (apparently a term due to Neyman but nicely described in Rubin’s Multiple Imputation After 18+ Years) requires that

\(E[\widehat{Var}(\widehat{\theta_.})]=Var(\widehat{\theta_.}).\)

That is, the variance estimator should be unbiased for the variance of the corresponding point estimator. This is how variance estimators are usually assessed, e.g. in simulation studies, and (loosely) leads to confidence intervals with nominal coverage. Rubin’s 1996 paper also discussed a weaker property termed confidence validity, which is that

\(E[\widehat{Var}(\widehat{\theta_.})] \geq Var(\widehat{\theta_.}).\)

This leads to the property that confidence intervals have over-coverage.

Information anchoring requires that:

\(E[\widehat{Var}(\widehat{\theta_{RB}})] \geq Var(\widehat{\theta_{full}})\)

(ok, it’s actually a bit sharper than this as information anchoring also talks about the rate of information loss). Notice how it looks similar to the self-efficiency criterion except that, rather than the variance of the estimator, the LSH is the estimator of the variance.

The way I think of this is: if we accept Meng & Romero’s view, that self-inefficiency is a reason to throw out a point estimator, then an information-anchored variance estimator is a life-jacket! Unfortunately there is an issue: if a point estimator is self-inefficient but its variance estimator is information-anchored, this implies that the procedure is not randomisation valid (since it requires that the LSH of both information anchoring and self-efficiency be the same, and I’ve just said they are not). The good news is that you do have confidence validity. So from a frequentist perspective, your inference will be conservative.

Brief aside on confidence validity vs. randomisation validity

People often think of poor coverage as poor coverage, but of course it’s not. We should not think symmetrically about under- and over-coverage. This may seem blindingly obvious, but lots of people miss it, so. Suppose we are aiming to construct 95% confidence intervals. We have three procedures: one traps the true value 95% of the time, one traps 100%, and one traps 50%. If this were the only consideration, which would you prefer? The answer is obviously the one with 100% coverage.

The reason to object to the above question is that there are considerations other than coverage. You’re probably thinking that a procedure that covers 100% of the time does so because it has longer intervals than the 95% and 50% confidence procedures. And if they all used the same point estimator, you would be right. But things are not always this simple: in this letter to the editor (funnily enough with Jonathan Bartlett, Suzie Cro, Ian White and James Carpenter, mentioned above on opposing sides!), I ran a simulation study in which a so-called superefficient multiple imputation procedure had 100% coverage but also shorter intervals than a so-called congenial multiple imputation procedure which had ≈95% coverage. Which would you prefer?

Note that this question should be be rhetorical but unfortunately some people earnestly claim to prefer the congenial MI procedure. They need to go back and think about it harder. (See Meng’s rejoinder in which he discusses the choice betweeen cheap orange slices that more than cover a plate vs. expensive slices that only just cover it, when they taste the same.)

Back to the program

A nice framing for thinking about information anchoring (which is I think how Suzie Cro put it in her PhD work), is that we should compare the properties of reference-based point and variance estimator with properties if we were to observe the reference-based behaviour. Suppose your RCT has two arms, where Z=a is assignment to the new intervention and Z=b is assignment to standard-of-care. Some participants assigned to arm a never take/receive their assigned treatment, and we are interested in handling this using a treatment policy strategy. With no missing data, estimation is straightforward. Suppose the outcomes of those assigned to arm a have an identical distribution to outcomes of participants assigned to arm b. Then the distribution of outcomes on arm a is just a mixture of the distributions for those who did and did not take/receive the assigned intervention, and some function (usually the mean) of this distribution is compared between the arms. Now suppose we did not observe the outcomes of those who did not take/receive a, then we might use jump-to-reference imputation. Information anchoring just says that we should not be able to estimate higher statistical information from the latter case than in the former.

TL;DR

Sheesh, this one got long! To summarise my argument in a sentence: If your point-estimator is not self-efficient, it needs rescuing. Information-anchored inference achieves this. Frequentist calibration does not, and is the nail in the coffin.

P.S.

No doubt I’ll edit this post in future. Currently, Suzie Cro, James Carpenter, James Roger and submitted a letter to the editor on the above. In this post, I’ve focused on a frequentist perspective, but James R wrote a bit about the (or rather ‘a’) Bayesian perspective in the letter. I’m looking forward to reading the authors’ response because I find this super interesting and would be open to changing my mind.

Frank Harrell

Dec 8, 2023

Tim this is super helpful. II was not aware of Wood’s paper or yours. I’ll read both. I hope that Chan & Meng performs better but time will tell. I wish they had answered my email. When authors write a paper and then disappear it’s hard to make their research pay off.

Expand full comment

Nov 10, 2023

Nice post. I’m interested in reading more about the Bayesian perspective. I think we may need to retire Rubin’s rule and go Bayesian because of inaccuracies in confidence coverage that result in assuming either that the overall effect estimate has a normal distribution or has a t distribution with degrees of freedom approximated by another formula of Rubin. Multiple imputation results in heavier-than-normal tails of the sampling distribution of an estimator, even when the model is purely Gaussian. By conducting separate Bayesian analysis for each completed dataset and using posterior stacking to get an overall posterior distribution, this problem largely takes care of itself. But concerning the original goal of your analysis I wonder if formal Bayesian modeling can deal with the particular type of MNAR you’re addressing.

3 replies by Tim Morris and others

3 more comments...

Statistical methodology meanderings

Discussion about this post