Everyone here is probably familiar with the reproducibility crisis in psychology and various other fields. I've recently been thinking there's something very odd about all this. Namely, the reproducibility crisis seems to be almost entirely based on randomized controlled trials (RCTs). In terms of statistical design, these are the absolute gold-standard! Yet, my impression is that the majority of results in the social sciences are based on observational studies, not RCTs. While there's usually at least some attempt to control for confounders, I feel like all the problems that contribute to the reproducibility crisis so far are 10x worse here: there's so many more degrees of freedom in how you could set up the analysis.

Is my perception that the reproducibility crisis hasn't really gotten to observational studies yet correct? If so, why not? And am I right to think that if/when these start getting checked, they are likely to be found even more unreliable?

I find it so puzzling that these seem to have mostly escaped scrutiny so far, and wonder if there's a whole movement somewhere that I just haven't encountered.

New Answer
New Comment

1 Answers sorted by

One reason is that experiments are designed to be repeatable. So we repeat them or versions of them. Observational studies are really just sophisticated analyses of something that has happened, so one may more successfully appeal to "this time/context is different."

Another reason is that a lot of times observational studies have proprietary/secret data. Publicly-available datasets are less susceptible to this excuse but sometimes those are paired with proprietary data to develop the research.

I was literally in a seminar this week where Person A was explaining their analysis of proprietary data and found a substitution effect. Person B said they had their own paper with a different proprietary dataset and found a complementarity effect. Without fraud, with perfectly reproducible code, and with utter robustness to alternative specifications, these effects could still coexist. But good luck testing robustness beyond the robustness tests done by the original author(s), good luck testing if the analysis is reproducible, and good luck testing whether any of the data is fraudulent...without the data.

I'm not even advocating for open data, just explaining that the trade secret and "uniqueness" differential make observational studies less conducive to that kind of scrutiny. For what it's worth, reviewers/authors tend to demand/provide more specification robustness tests for observational than experimental data. Some of that has to do with the judgment calls inherent in trying to handle endogeneity issues in observational research (experiments can appeal to random assignment) and covering bases with respect to those "so many more [researcher] degrees of freedom."

A third reason is simply that the crisis struck effects tested with low-powered studies. Because of implementing a statistical significance filter, the published effects were therefore more likely to be overestimates of the true effects. The N of observational studies mitigates this source of overestimation but runs the risk of finding practically-trivial effects (so sometimes you'll see justifications for how the effect size is actually practically meaningful).

2 comments, sorted by Click to highlight new comments since: Today at 11:37 PM

Just as an aside, I would edit the title of this question to match the text, something like: "Why has the replication crisis affected RCT-studies but not observational studies?"