Why "controlling for a variable" doesn't (usually) work
It's just adding a variable to a regression.
I’ve always seen cathedrals as presenting a kind of implicit argument to atheists. Something like: God must exist, because otherwise it would have been insane for people to build this:
This is how I feel about the idea of “controlling” for stuff in statistics. You read things like this all the time:
X is associated with Y, controlling for Z.
These analyses are everywhere in contexts where what we care about is if X causes Y. Everyone knows correlation doesn’t imply causation. But here we controlled for Z. So maybe it’s OK? Everyone seems to act like it’s OK. If these analyses didn’t work, then everybody would be crazy. So they’ve got to work, right?
What does it mean to “control” for something? Papers love to use scary terms like Cox Proportional-Hazards Model and Linear Structural Causal Equations. These hide the embarrassing reality that all it means to control for something is to add a variable to a regression.
Say you want to know if drinking alcohol makes people fatter. So you go find some people and measure how much they weigh and how much they drink.
(Of course, you can’t conclude anything from just five data points—I’m just trying to keep things simple.)
If you squint, the trend seems to be that heavier drinkers are a bit lighter. You can confirm this by fitting a regression: You look for constants a and b such that
(weight) ≈ a + b × (alcohol).
You want the best approximation, i.e. the constants that minimize the average squared error. If you do this for the above dataset, you get a fit of
(weight) ≈ 89.6 - 4.5 × (alcohol).
The constant in front of alcohol is negative. In statistics jargon, you can say:
Weight is negatively associated with alcohol.
The naive interpretation of this would be that drinking makes you thinner. Which would be very weird if true.
But then you have an idea: It could be that the same people who drink less also tend to eat more. So you call up everyone and ask how much they eat in a day:
If you fit a regression to this dataset, you’ll get
(weight) ≈ 20.1 + 0.3 × (alcohol) + 24.0 × (calories).
Now the constant in front of alcohol is positive. Or, in statistics jargon:
Weight is positively associated with alcohol, controlling for calories.
That’s it. You “controlled” for calories. In almost all cases, this is what it means to control for something. The details vary a bit depending on the analysis, but it’s ultimately just sticking a new variable into a linear approximation somewhere.
So, does this work? Is this a good way to figure out if alcohol makes you fatter?
Not really, no.
Problem 1: Bro come on
To start, just use your intuition. Does it seem like you can uncover causal relationships by doing little regressions?
Think about the previous dataset. There are many possible relationships between weight, alcohol, and calories. Maybe being heavier makes people drink more. Maybe drinking makes people eat more. Maybe alcohol and food directly do nothing, but they’re correlated with exercise, which is actually what determined how fat people are. Maybe the true relationship is nonlinear, and all the variables influence each other in feedback loops. Any of these relationships could produce the previous data.
People seem to think that there’s some secret math that means you don’t need to worry about this stuff. Otherwise, what is everyone doing with these analyses?
There is no secret math. The biggest mistake people make with statistics is to distrust their intuition. In reality, once you do all the math, the things that seemed like they’d be problems are in fact problems. If anything, the math just turns up more things to worry about.
But how is this possible? After all, these methods have careful formal justifications. With proofs! And Borel spaces! Surely those aren’t wrong?
They are not wrong. But you must understand how these proofs get around the difficult problems: Roughly speaking, they assume the problems don’t exist.
Problem 2: Reverse causality
Take a group of aliens that only vary in how much they drink and how fat they are. Nothing else matters: They don’t eat or exercise, they all have the same genes, etc. Now suppose drinking and weight are positively associated. There are three possible explanations:
First, maybe drinking makes them heavier. (Perhaps because alcohol has calories.) We can draw this as
ALCOHOL → WEIGHT.
Second, maybe being heavier causes them to drink more. (Perhaps because there’s a hormonal change in heavier people that makes alcohol taste better.) We can draw this as
ALCOHOL ← WEIGHT.
Third, maybe drinking and weight are in some kind of complex feedback loop. (Perhaps because both of the above things happen.) We can draw this as
ALCOHOL ↔ WEIGHT.
How do you use observational data to figure out which explanation is right?
The answer is easy: You don’t. Any data that comes from one of these causal models could also come from any of the others. The difference is completely invisible. The only way to tell is through intervention—make people drink more and see what happens.
To estimate causal relationships from observational data, you start by making assumptions about which variables cause changes in which others, and then use data to figure out the strengths of the interactions. You assume which direction the arrows are pointing, and then you figure out how big the arrows are.
I know that seems odd—isn’t the direction of the arrows the central question? You’d think that there’s some caveat here, but there just isn’t. People assume X causes Y and then use that assumption to figure out how strongly X causes Y. If that assumption is wrong, the analysis will happily give you a wrong answer.
Problem 3: Dependent features
If you say that “drinking makes people fatter”, what do you mean by that? Here are two options:
If people drank more but didn’t change how they eat, they’d be fatter.
If people drank more and changed how they eat as a result of drinking more, they’d be fatter.
These are not the same!
Suppose that eating makes you fatter, and eating makes you drink more, but drinking has no direct impact on weight:
ALCOHOL ← FOOD → WEIGHT
If you did a randomized controlled trial and assigned people to drink different amounts, there would be no impact on weight. However, alcohol and weight are associated, because food acts as a confounder. You can fix the problem by controlling for food, which makes alcohol and weight unassociated. Here, controlling works.
But maybe the causal model is different. Maybe alcohol has no direct influence on weight, but alcohol causes you to eat more, and eating more makes you fatter:
ALCOHOL → FOOD → WEIGHT
Now, an RCT would show that drinking does make people fatter. But just like with the previous causal model, alcohol and weight are associated, but unassociated if you control for food.
Get that? Here’s a little table:
In the two cases, an RCT will give you totally different results. But alcohol is always associated with fatness, and always unassociated once you control for food.
Why? Because there is no visible difference between the two causal models in observational data. Any dataset that comes from one causal model could just as easily come from the other one. You simply need to guess which one is right, and you better not be wrong.
Problem 4: More dependent features
It gets worse. In humans, it’s likely that alcohol and food both influence each other in a feedback loop. And it’s likely that both have direct impacts on weight, leading to this causal model:
ALCOHOL ↔ FOOD
↘ ↙
WEIGHT
In an RCT, drinking will again show an impact on weight. But what about the association between alcohol and food in observational data? Without controlling for food, the association will be stronger than in an RCT, due to food acting as a confounder. But if you do control for food, the association will be weaker than in an RCT, since the control blocks the indirect impact of alcohol on weight via food.
You’re screwed either way. The only way to get it right would be to somehow guess exactly what fraction of the ALCOHOL ↔ FOOD association is due to causality flowing in either direction. Which is invisible in the data. Good luck with that.
But wait, there’s more! Maybe weight also influences what people consume, so everything causes everything else:
ALCOHOL ↔ FOOD
⤡ ⤢
WEIGHT
Abandon all hope ye who encounter this causal model.
Problem 5: Everything else
So far, I’ve focused on the most serious problem with “controlling”: It only works if influence flows in a simple unidirectional way and you already know which direction. But there are many other issues, too:
Missing causes. There might be important variables that are missing from the model. Sometimes these are things that you just failed to measure (like EXERCISE). Sometimes these are ethereal and unmeasurable (like PSYCHOLOGICAL COMMITMENT TO A HEALTHY LIFESTYLE).
Linearity. Fitting linear models assumes interactions are linear. Often there’s a feeble attempt to address this by testing what happens if you add interactions or quadratic terms. But there’s no guarantee that a quadratic model is sufficient either. And people often do weird things like compute a p-value for the quadratic term, observe that the p-value is large, and therefore conclude that the quadratic term isn’t needed. (That isn’t how p-values work.)
Coding. Everything depends on how you code variables. Like, when people say they control for “education”, what they actually did is choose some finite set of categories, e.g. (highschol or less) / (some college) / (4-year degree or more). But you could break these things up differently, and when you do, the results often change.
Noisy controls. People might say they control for “diet”. But what they actually control for is “how much people claimed to eat on 5 days they were surveyed”. If they don’t remember, or they lie, or those days were unusual, what happens? Well, in the limit of pure noise, the regression will just ignore the control variables, equivalent to not controlling at all. In practice, the effect is usually somewhere in the middle, meaning that things are just partially controlled. It’s amazing how rare it is for papers to show any awareness of this issue.
Association doublethink
Of course, there’s a place for observational studies. For many things we care about, true experiments are impossible—we can’t run RCTs where we assign people to breathe different amounts of nitrogen dioxide or particulate air pollution for their entire lives. So we have to content ourselves with observational estimates, as imperfect as they are.
And of course, all the problems I’ve pointed out are well-known. This is why journals won’t let you do an observational study and then say that you’ve proven anything about causality.
So what happens? Well, many people accept this. They insist on RCTs whenever possible. (Just try getting a drug approved with observational data!) When an RCT can’t be done, they’ll try to use natural experiments or instrumental variables as an approximation. When even that’s impossible, they’ll use observational data, but obsess about the assumptions and treat the conclusions with great humility.
But many people don’t seem to accept the harsh limitations of associations and controls. And sometimes these people form entire scientific sub-communities. They train each other, and accept each others’ papers and give quotes to journalists when someone’s paper is published. Here’s what they do:
Take whatever variables they can measure, and throw them into off-the-shelf statistical packages.
Don’t worry about (or, maybe, understand) all the assumptions that are being made. Certainly, don’t discuss them.
When writing papers, never say “cause”. Instead use words like “associated” or “risk factor” exactly as if they meant “cause”.
For example, take the paper Artificial sweeteners and cancer risk: Results from the NutriNet-Santé population-based cohort study. Here are some quotes:
In this large cohort of 102,865 French adults, artificial sweeteners […] were associated with increased overall cancer risk
Our findings do not support the use of artificial sweeteners as safe alternatives for sugar in foods or beverages and provide important and novel information to address the controversies about their potential adverse health effects.
And here’s one from the press release—picked up by all the usual copypasta places and presumably read by the general public.
This large-scale, prospective study suggests that artificial sweeteners, used in many foods and beverages in France and throughout the world, may represent an increased risk factor for cancer.
What game is being played here? Are we seriously supposed to pretend that the goal is anything other than to make people think that artificial sweeteners cause cancer?
I think this is disgraceful. If you think you’ve shown something about causality, then say so explicitly. Have the courage to argue for what you think is true. Drop this gambit of bending words so you can act like you’ve proven causality, but not have to defend your conclusions.
This is bad. But I do have some sympathy for the authors. If everyone you know is distorting words like this, it’s natural to think this is the normal way to use those words. And if everyone you know is doing research like this, it’s natural to think that this is the right way to do research. Because if it wasn’t…
Now, I like cathedrals. Beyond simply being pretty, they are a celebration of human agency, a demonstration of what dedicated people can accomplish over generations. Misleading observational studies have fewer obvious benefits.
I find instrumental variable approaches an ingenious solution to many of the ills meantimes above. Somehow IV is still fairly confined to the field of economics. Prolly coz econ is full of simultaneous equation problems + genuine instruments are feverishly difficult to find. I find The Wind of Change: Maritime Technology, Trade, and Economic Development†
By Luigi Pascali* a great example for this approach to analysing observation data. Using the change in effective distance between countries caused by the switch from sail to steampowered ships as an instrument, Pascali estimates the effect of trade on per capita income. Note that the size of the change in shipping time from steamboat technology (effective distance) is a geographical feature, largely determined by the countries location vis-a-vis the relevant trade winds. Meaning the change in effective distance cannot itself be an effect of per capita income, nor affect per capita income other than through trade levels. This allows the author to extract the causal arrow from the data, rather then assuming it a priori. Pretty cool.
Like a better written and less long-winded version of what Judea Pearl was talking about in "The Book of Why".