Their actual implementation is rather hand wavy, but I think getting people pointed in the right direction is a lot more important than the particulars. The high visibility of FiveThirtyEight makes me think this is a big win for sanity. I'm curious as to what others think. Which parts were done well? What might have been done differently?

New Comment
10 comments, sorted by Click to highlight new comments since:

I liked the initial discussion, because broad heuristics are good for quickly evaluating things, but I think the second example really falls down. A poorly designed study shouldn't be able to affect your odds as much as a well designed study, which is basically what his scoring system implies. He goes from 1/10 odds to 1/160 odds based on a study design which should provide very little evidence. One could argue that a poorly designed study finding a small effect should lower your odds slightly (because of publication bias, for example), or that it should raise your odds slightly because there was at least a small effect, but I find it hard to believe that it could decrease your odds substantially. Suppose it were something you felt was extremely likely (perhaps because of previous medium-quality studies), and you found an extremely poorly designed study that supported the conclusion. His reasoning would suggest that you decrease your odds from, say 4/1 to 1/4 based on the poorly designed study!

Yeah. This is an example where using the actual formula is helpful rather than just speaking heuristically. It's actually somewhat difficult to translate from the author's hand-wavy model to the real Bayes' Theorem (and it would be totally opaque to someone who hadn't seen Bayes before).

"Study support for headline" is supposed to be the Bayes factor P(study supports headline | headline is true) / P(study supports headline | headline is false). (Well actually, everything is also conditioned on you hearing about the study.) If you actually think about that, it's clear that it should be very rare to find a study that is more likely to support its conclusion if that conclusion is not true.

EDIT: the author is not actually Nate Silver.

If you're just looking at the study, then it's quite difficult for the support ratio to be less than one. However, suppose we assume that on average, for every published study, there are 100 unpublished studies, and the one with the lowest p-value gets published. Then if a study has a p-value of .04, that particular study supports the headline. However, the fact that that study was published contradicts the headline: if the headline were true, we would expect the lowest p-value to be lower than .04.

Yes, that's what I meant by "very rare:" there are situations where it happens, like the model that you gave, but I don't think ones that happen in real life likely to contribute a very large effect. You need really insane publication bias to get a large effect there.

It is not the odds the headline is true, nor the odds the study is correct, but only the odds the study supports the headline. For that, I don't find his rule of thumb inappropriate.

No. The odds that the study supports the headline in the second example are 1/16. The formula he gives is

(final opinion on headline) = (initial gut feeling) * (study support for headline)

where the latter two are odds ratios. From context, "final opinion on headline" is pretty clearly supposed to be "opinion on whether the headline is true."

I criticised it here: "Dr Leek's shortcut gives a decent approximation of the Bayes Factor. The study, though not an RCT, is a large one showing a big positive effect on relevant metrics. We would be unlikely to encounter this evidence if the headline was false.

However, in other scenarios, Leek's shortcut will give approximations that are clearly not right. Suppose that a study meets all of the criteria except that on one criterion it is fatally flawed. Suppose the sample size is far too small (e.g. there is only one participant), or rather than mortality and morbidity rates, the study measures something that doesn't matter at all like physician satisfaction with the surveys. This can cripple the study entirely, and so the Bayes Factor should not just from 64 to 32 - it should fall to around 1. A fatally flawed study is no-longer useful evidence - it is no likelier to appear where the headline is true, in world 1, than in world 2 where it is false."

The author does a horrible job of distinguishing between odds and probabilities. I would expect "the set of people who are able to clearly recognize that he is not talking probabilities" and "people who did not already know about Bayes' theorem and gained significant knowledge from the article" to have a rather small intersection. Not only does he not explicitly draw attention to the distinction between odds and probabilities, he repeatedly uses a slash rather than a colon, and in one case uses the preposition "in" rather than "to".

This example was terrible. I practically gagged. The airplane search example was much better. Lots of information that you need to combine in some coherent fashion despite its being extremely inhomogeneous in nature and quality.

(Weird: I happened to check it for the first time in months on launch day, without knowing it was up)