Lesswrong has a fair bit of advice on how to evaluate the claims made in scientific papers. Most of this advice seems to focus on a single-shot use case - e.g. a paper claims that taking hydroxyhypotheticol reduces the risk of malignant examplitis, and we want to know how much confidence to put on the claim. It’s very black-box-y: there’s a claim that if you put X (hydroxyhypotheticol) into the black box (a human/mouse) then Y (reduced malignant examplitis) will come out. Most of the advice I see on evaluating such claims is focused around statistics, incentives, and replication - good general-purpose epistemic tools which can be applied to black-box questions.
But for me, this black-box-y use case doesn’t really reflect what I’m usually looking for when I read scientific papers.
My goal is usually not to evaluate a single black-box claim in isolation, but rather to build a gears-level model of the system in question. I care about whether hydroxyhypotheticol reduces malignant examplitis only to the extent that it might tell me something about the internal workings of the system. I’m not here to get a quick win by noticing an underutilized dietary supplement; I’m here for the long game, and that means making the investment to understand the system.
With that in mind, this post contains a handful of thoughts on building gears-level models from papers. Of course, general-purpose epistemic tools (statistics, incentives, etc) are still relevant - a study which is simply wrong is unlikely to be much use for anything. So the thoughts and advice below all assume general-purpose epistemic hygiene as a baseline - they are things which seem more/less important when building gears-level models, relative to their importance for black-box claims.
I’m also curious to hear other peoples’ thoughts/advice on paper reading specifically to build gears-level models.
Ultimately, we want a magic bullet to cure examplitis. But the closer a paper is to that goal, the stronger publication bias and other memetic distortions will be. A flashy, exciting result picked up by journalists will get a lot more eyeballs than a failed replication attempt.
But what about a study examining the details of the interaction between FOXO, SIRT6, and WNT-family signalling molecules? That paper will not ever make the news circuit - laypeople have no idea what those molecules are or why they’re interesting. There isn’t really a “negative result” in that kind of study - there’s just an open question: “do these things interact, and how?”. Any result is interesting and likely to be published, even though you won’t hear about it on CNN.
In general, as we move more toward boring internal gear details that the outside world doesn’t really care about, we don’t need to worry as much about incentives - or at least not the same kinds of incentives.
Few people want to start a fight with others in their field, even when those others are wrong. There is little incentive to falsify the theory of somebody who may review your future papers or show up to your talk at a conference. It’s much easier to say “examplitis is a complex multifactorial disease and all these different lines of research are valuable and important, kumbayah”.
The result is zombie theories: theories which are pretty obviously false if you spend an hour looking at the available evidence, but which are still repeated in background sections and review articles.
One particularly egregious example I’ve seen is the idea that a shift in the collagen:elastin ratio is (at least partially) responsible for the increased stiffness of blood vessels in old age. You can find this theory in review articles and even textbooks. It’s a nice theory: new elastin is not produced in adult vasculature, and collagen is much stiffer, so over time we’d expect the elastin to break down and collagen to bear more stress, increasing overall stiffness. But if we go look for studies which directly measure the collagen:elastin ratio in the blood vessels… we mostly find no significant change with age (rat, human, rat); one study even finds more elastin relative to collagen in older humans.
Scientists say lots of things which are misleading, easily confused, or aren’t actually supported by their experiments . That doesn’t mean the experiment is useless, it just means we should ignore the mouth-motions and look at what the experiment and results actually were. As an added bonus, this also helps prevent misinterpreting what the paper authors meant.
An example: many authors assert that both (1) atherosclerosis is a universal marker of old age among humans and most other mammals, and (2) atherosclerosis is practically absent among most third-world populations. What are we to make of this? Ignore the mouth motions, look for data. In this case, it looks like atherosclerosis does universally grow very rapidly with age in all populations examined, but still has much lower overall levels among third-world populations after controlling for age - e.g. ~⅓ as prevalent in most age brackets in 1950’s India compared to Boston.
For replication, you want papers which are as similar as possible, and establishing very high statistical significance matters. For gears-level models, you want papers which do very different things, but impinge on the same gears. You want to test a whole model rather than a particular claim, so finding qualitatively different tests is more important than establishing very high statistical significance. (You still need enough statistics to make sure any particular result isn’t just noise, but high confidence will ultimately be established by incrementally updating on many different kinds of studies.)
For example, suppose I’m interested in the role of thymic involution as a cause of cancer. The thymus is an organ which teaches new adaptive immune cells (T-cells) to distinguish our own bodies from invaders, and it shrinks (“involutes”) as we age.
Rather than just looking for thymus-cancer studies directly, I move away from the goal and look for general information on the gears of thymic involution. Eventually I find that castration of aged mice (18-24 mo) leads to complete restoration of the thymus in about 2 weeks. The entire organ completely regrows, and the T-cells return to the parameters seen in young mice. (Replicated here.) Obvious next question: does castration reduce cancer? It’s used as a treatment for e.g. prostate cancer, but that’s (supposedly) a different mechanism. Looking for more general results turns up this century-old study, which finds that castration prevents age-related cancer in mice - and quite dramatically so. Castrated old mice’ rate of resistance to an implanted tumor was ~50%, vs ~5% for controls. (This study finds a similar result in rabbits.) Even more interesting: castration did not change the rate of tumor resistance in young mice - exactly what the thymus-mediation theory would predict.
This should not, by itself, lead to very high confidence about the castration -> thymus -> T-cell -> cancer model. We need more qualitatively different studies (especially in humans), and we need at least a couple studies looking directly at the thymus -> cancer link. But if we find a bunch of different results, each with about this level of support for the theory, covering interventions on each of the relevant variables, then we should have reasonable confidence in the model. It’s not about finding a single paper which proves the theory for all time; it’s about building up Bayesian evidence from many qualitatively different studies.
Everything is correlated with everything else; any intervention changes everything.
That said, very few things are directly connected; the main value is finding variables which mediate causal influence. For instance, maybe hydroxyhypotheticol usually reduces malignant examplitis, but most of the effect goes away if we hold hypometabolicol levels constant. That’s a powerful finding: it establishes that hypometabolicol is one of the internal gears between hydroxyhypotheticol and examplitis.
If I had to pick the single most important guideline for building gears-level models from papers, this would be it: mediation is the main thing we’re looking for.
This post proposes 4 ideas to help building gears-level models from papers that already passed the standard epistemic check (statistics, incentives):
(The second section, “Zombie Theories”, sounds more like epistemic check than gears-level modeling to me)
I didn’t read this post before today, so it’s hard to judge the influence it will have on me. Still, I can already say that the first idea (move away from the goal) is one I had never encountered, and by itself it probably helps a lot in literature search and paper reading. The other three ideas are more obvious to me, but I’m glad that they’re stated somewhere in detail. The examples drawn from biology also definitely help.
This was super useful for me. Reading this post was causal in starting to figure out how to do statistical analysis on my phenomenology-based psychological models. We'll see where this goes, but it might be enough to convert my qualitative model-building into quantitative science!
I wonder how hard it would be to formalize this claim about mediation in terms of the space of causal DAGs. I haven't done the work to try it so I'm mostly spitballing.
Informally, I associate mediation with the front-door criteria in causality. So, the usefulness of mediation should reflect that the front-door criterion is empirically easier to satisfy than the back-door (and other) criteria maybe because real causal chains tend to be narrow but long? Thinking about it a bit more, it's probably more like the min cut of real world causal graphs tend to be relatively small?
Curated. Figuring out how to make sense of other's people's research seems like among the most important epistemological questions. I appreciated both the theory here, and the concrete examples.
A well-written post on an important facet of intellectual progress.