The marginality principle revisited

It’s note that the principle doesn’t make sense; it’s that the notion of ‘higher-order’ and ‘lower-order’ terms often doesn’t, and then the principle is a bit pointless.

Oct 05, 2023

This post is about a paper with Maarten van Smeden and Tra My Pham that has just appeared in Biometrical Journal ‘The marginality principle revisited: Should “higher-order” terms always be accompanied by “lower-order” terms in regression analyses?’ (it’s open access, obvs).

When are you having fun at work? For me, I enjoy the moments where you feel you’ve understood something in a way that you haven’t before. Better is when you share the moment with someone else and they have the same ‘o-oh’ moment. Best is when you realise this lots of people will have the same reaction, so they might also enjoy the insight. This doesn’t happen a lot to me, obvs, and when it does, you often discover that some people have understood all along and think it’s very obvious.

With this in mind, a couple of years back Darren Dahly (who btw has an excellent blog on here) and I were discussing ratio variables, and I think I shared one of these moments with a couple of people (the co-authors at least). Suppose we measure variables A and B and derive their ratio R=A/B. Body mass index probably springs to mind but there are loads floating around. Total cholesterol to HDL. Heart rate. Miles-per-gallon (sometimes its reciprocal gallons-per-mile). Darren relayed that a well-known researcher recommends never including R as a covariate in regression models because R is a higher-order term. They instead recommended that we should include the two components, A and B, from which R was derived. I wouldn’t have thought much about it, but for two reasons:

We often read ideas about simple analyses and parsimonious models being ‘good’. I sympathise with keeping things simple to make sure you understand them, but don’t think there is anything inherently justifiable about simplicity.
I’d done some work on ratios before. Some criticisms seemed reasonable, others didn’t, but this seems to have led to broad advice to never use ratios.
Most importantly (or pettily?) I have never once agreed with anything this particular well-known researcher says.

Because of (3), the moment Darren mentioned this advice, I knew it was suspicious.

Write the ratio R=A/B, then we can rearrange this as B=A/R. Hang on… R was the ratio, but now B looks like one… weird. If we write it this way, is the well-known researcher’s advice that we should include the separate components of B (A and R) as the covariates in the model? That’s changed, because previously they were unhappy with R. Finally we can write A=RB. Now A is the interaction of R and B? This made me think of the marginality principle, which tends to be attributed to Nelder’s 1977 RSS read paper A reformulation of linear models. Nelder bemoaned ‘the neglect of marginality’ as (roughly) omitting main effects when testing and interpreting interactions. So if A is an interaction, Nelder would have said don’t include R or B as covariates without A. You can see that, if we write any or R, A, or B as a function of the other two (so A/B, RB, or A/R), it appears to be ‘higher-order’.

Apart from ratios, I’ve seen the principle used is for polynomial models (if you include X², you must include X; if you include X³, you must include both X and X², and so on) and interactions.

For models with polynomial terms, it didn’t really make any sense to be because I’ve spent a while with fractional polynomial models. Rather than using powers of X in order as 1, before 2, before 3, a first-degree fractional polynomial picks one from a discrete set of powers, typically p={–2, –1, –0.5, 0, 0.5, 1, 2, 3} (p=0 corresponds to lnX… if you want to know why, I think it’s in Box and Tidwell’s paper). The power that gets picked is the one that maximises the likelihood (or other). Essentially this approach treats the power p as a parameter and estimates the βs and p jointly, so the linear predictor is:

$\widehat{\beta}_0+\widehat{\beta}_1X^\widehat{p}$

There is nothing special about p=1 here, just as there is nothing special about β₁=1, but the marginality principle would say there is. This is more explicit in (related) Box–Tidwell models, where p is free, rather than chosen from some pre-defined set (btw I think the discrete set essentially works as a form of regularisation in FP but it does decalibrate the tests used in FP by not spending a whole degree of freedom… I digress).

Finally, let’s mention interactions in factorial designs. Nelder’s original piece was an RSS read paper. Lindley gently commented that if you recode ‘new’ and ‘control’ treatment then the ‘marginal’ term switches. Nelder was pretty abrupt in his response (said this does not ‘make any practical sense’). I’m not sure that Nelder appreciated Lindley’s point, which was extremely practice. Perhaps it was Kempthorne’s aggressive commentary that made Nelder uncharitable to other discussants.

Suppose we have a factorial trial with two treatments, denoted Z1 and Z2, each with two levels. Sometimes the definition of ‘control’ and ‘experimental’ treatment is arbitrary. An example is given in Dunn, Copas and Brocklehurst’s paper: the CAP-IT trial assessed the dose and duration of amoxicillin treatment for children with community-acquired pneumonia. The doses under study were 125mg and 250mg. The former was used in clinical practice and the latter specified in the British National Formulary. So suppose our factorial trial had some Z1, and this dose ‘treatment’ was Z2. We could encode 125mg as Z2=0 and 125mg as Z2=1, or 125mg as Z2=1 and 250mg as Z2=0. Write out the interaction Z1Z2 for each encoding, and you will see that ‘interaction’ and ‘main effect’ depends on the encoding you used.

It shouldn’t be controversial to say that the form of a model we use should not depend on the arbitrary way we happened to encode the treatments!

In summary, the marginality principle talks about ‘lower-order’ and ‘higher-order’ terms in regression models. Our paper questioned whether it makes sense to designate a nonlinear function of a measured variable as ‘higher-order’. This doesn’t even need us to think about a particular model, just measurements.

Some postscripts:

We used ratios as a way to get the ideas going, and had a subsequent discussion with Peter Tennant, who has thought a lot harder about ratios than most of us. He is generally against using ratios as covariates for causal estimands, and I agree with most of his views. I should be clear that we’re not advocating for everyone to use ratios and ignore the components in their models. Rather we were making the point that it’s not ‘being a ratio’ that tells you whether something can or cannot go in a regression model; it’s an understanding of the problem at hand, and this is what Peter uses in his arguments.
I said above that a moment of understanding is often something some people have understood all along but others haven’t. This seems to be true with the marginality principle. Are you ever the person on the other side who’s thinking ‘Err, yes this is super obvious’? I think people who understand the curse-of-dimensionality would have realised this all along. Someone on twitter (possibly a bot using a statistician’s name) spent a long time replying to our tweets and every other reply/quote to say how much they hated it; not because they disagreed but because they thought everyone already understands all this. Great… or, rather, wrong! If they’d been less of a $*£#& about it I might have pointed them to find literally any factorial trial in medical research.
Interestingly, I think the marginality principle has contributed to some statisticians’ suspicions about machine learning methods. E.g. tree-based methods involve splits on subsets of the data, meaning that they are effectively the antithesis of the marginality principle as they cannot respect marginality except by luck.

Statistical methodology meanderings

Discussion about this post