Scientific studies are written to be published. Like many things there is an element of "teaching to the test" - writing what the editors and peer reviewers want to see. The gatekeepers want to see strong and novel results. They don't care so much about the functional implications. Meanwhile in the real world, all we care about are the functional implications.
Thanks, I read a bunch of stuff from Pearl researching this post. (I found this paper particularly useful: https://doi.org/10.1214/09-SS057) In an early version I actually started trying to explain the do-calculus and the difference between p(y|x,z) and p(y|x,do(z)) and so on but eventually decided it wasn't necessary to go that far out into the weeds!
I find instrumental variable approaches an ingenious solution to many of the ills meantimes above. Somehow IV is still fairly confined to the field of economics. Prolly coz econ is full of simultaneous equation problems + genuine instruments are feverishly difficult to find. I find The Wind of Change: Maritime Technology, Trade, and Economic Development†
By Luigi Pascali* a great example for this approach to analysing observation data. Using the change in effective distance between countries caused by the switch from sail to steampowered ships as an instrument, Pascali estimates the effect of trade on per capita income. Note that the size of the change in shipping time from steamboat technology (effective distance) is a geographical feature, largely determined by the countries location vis-a-vis the relevant trade winds. Meaning the change in effective distance cannot itself be an effect of per capita income, nor affect per capita income other than through trade levels. This allows the author to extract the causal arrow from the data, rather then assuming it a priori. Pretty cool.
You may also be in this paper elaborating on your point that "When writing papers, never say “cause”. Instead use words like “associated” or “risk factor” exactly as if they meant “cause”. This paper is saying that authors should openly acknowledge their causal assumptions and the taboo about causal statements is bad. Additionally, the paper argues that with more advanced analytic techniques (I think related to DAGs and methods by Judea Pearl), better assessments of causality are possible: https://journals.sagepub.com/doi/pdf/10.1177/1745691620921521
I don't know much about causal modeling.
On another noted, related to "coding", I remember another paper saying this more formally. Their point was that measurement A (e.g., college years) is correlate with a "true construct" B (e.g., education), and controlling for A only helps insofar as A is correlated with B. Otherwise, B could still confound the relationship between X and Y.
Thanks, I like the paper. I totally agree that explicit discussions of causality would be a step forward. The current weirdness of everyone collectively using "association" but interpreting it as causality just allows people to make causal arguments without having to think deeply about what they are doing!
The coding problem you bring up is definitely a big issue (it's sort of what I called "noisy controls"). Of all the issues, this has got to be the most underrated. It's just crazy how often people use noisy measurements without seeming to realize that this means their control is partial at best.
Quick note, off topic but worth mentioning: exercise is NOT what determines how fat people are. The truth is way more complicated than that and exercise is actually a very inefficient method to lose weight.
So, when using regression analysis one is supposed to understand if there are collinearities and such, not to mention one really does not want to do a regression if the expected relationship is not linear (and I mean one and only one y for any given X). I use regression a lot, and it really ads so much value. You are probably working with smarter people, but I do not. I work with people who think 7th grade algebra is too hard to use in every day life. I work with people who disagree with concepts like the time value of money, or that linear programming yields correct answers.
Regression analysis is so much better than "Dale thinks we should do more of choice #4 because he likes it." Of course, sophisticated engineers who know what they are doing can use all kinds of more involved or diverse methods, but I worry when people trash simple but good methods because so many people cannot handle anything more, and any criticism ends up being used by charming morons who run so much of the American public sector. These people use statements such as "figures lie and liars figure." These are people in control of billion dollar+ budgets.
Regression analysis is not perfect, but it is so much better than its most common alternatives.
Agreed.
I'm increasingly under the belief that a significant fraction of academic studies are essentially worthless.
Scientific studies are written to be published. Like many things there is an element of "teaching to the test" - writing what the editors and peer reviewers want to see. The gatekeepers want to see strong and novel results. They don't care so much about the functional implications. Meanwhile in the real world, all we care about are the functional implications.
Like a better written and less long-winded version of what Judea Pearl was talking about in "The Book of Why".
Thanks, I read a bunch of stuff from Pearl researching this post. (I found this paper particularly useful: https://doi.org/10.1214/09-SS057) In an early version I actually started trying to explain the do-calculus and the difference between p(y|x,z) and p(y|x,do(z)) and so on but eventually decided it wasn't necessary to go that far out into the weeds!
Think everyone here is grateful for that decision!
You managed to capture everything important about a complex topic while keeping it remarkably simple.
Not to blow smoke up your arse, but this is really good work.
I find instrumental variable approaches an ingenious solution to many of the ills meantimes above. Somehow IV is still fairly confined to the field of economics. Prolly coz econ is full of simultaneous equation problems + genuine instruments are feverishly difficult to find. I find The Wind of Change: Maritime Technology, Trade, and Economic Development†
By Luigi Pascali* a great example for this approach to analysing observation data. Using the change in effective distance between countries caused by the switch from sail to steampowered ships as an instrument, Pascali estimates the effect of trade on per capita income. Note that the size of the change in shipping time from steamboat technology (effective distance) is a geographical feature, largely determined by the countries location vis-a-vis the relevant trade winds. Meaning the change in effective distance cannot itself be an effect of per capita income, nor affect per capita income other than through trade levels. This allows the author to extract the causal arrow from the data, rather then assuming it a priori. Pretty cool.
I also came here to say Instrumental Variables is a goated idea
You may also be in this paper elaborating on your point that "When writing papers, never say “cause”. Instead use words like “associated” or “risk factor” exactly as if they meant “cause”. This paper is saying that authors should openly acknowledge their causal assumptions and the taboo about causal statements is bad. Additionally, the paper argues that with more advanced analytic techniques (I think related to DAGs and methods by Judea Pearl), better assessments of causality are possible: https://journals.sagepub.com/doi/pdf/10.1177/1745691620921521
I don't know much about causal modeling.
On another noted, related to "coding", I remember another paper saying this more formally. Their point was that measurement A (e.g., college years) is correlate with a "true construct" B (e.g., education), and controlling for A only helps insofar as A is correlated with B. Otherwise, B could still confound the relationship between X and Y.
Thanks, I like the paper. I totally agree that explicit discussions of causality would be a step forward. The current weirdness of everyone collectively using "association" but interpreting it as causality just allows people to make causal arguments without having to think deeply about what they are doing!
The coding problem you bring up is definitely a big issue (it's sort of what I called "noisy controls"). Of all the issues, this has got to be the most underrated. It's just crazy how often people use noisy measurements without seeming to realize that this means their control is partial at best.
Quick note, off topic but worth mentioning: exercise is NOT what determines how fat people are. The truth is way more complicated than that and exercise is actually a very inefficient method to lose weight.
Succinct and to the point !
Can you recommend further reads ?
So, when using regression analysis one is supposed to understand if there are collinearities and such, not to mention one really does not want to do a regression if the expected relationship is not linear (and I mean one and only one y for any given X). I use regression a lot, and it really ads so much value. You are probably working with smarter people, but I do not. I work with people who think 7th grade algebra is too hard to use in every day life. I work with people who disagree with concepts like the time value of money, or that linear programming yields correct answers.
Regression analysis is so much better than "Dale thinks we should do more of choice #4 because he likes it." Of course, sophisticated engineers who know what they are doing can use all kinds of more involved or diverse methods, but I worry when people trash simple but good methods because so many people cannot handle anything more, and any criticism ends up being used by charming morons who run so much of the American public sector. These people use statements such as "figures lie and liars figure." These are people in control of billion dollar+ budgets.
Regression analysis is not perfect, but it is so much better than its most common alternatives.