It is fashionable, on LessWrong and also everywhere else, to advocate for a transition away from p-values. p-values have many known issues. p-hacking is possible and difficult to prevent, testing one hypothesis at a time cannot even in principle be correct, et cetera. I should mention here, because I will not mention it again, that these critiques are correct and very important - people are not wrong to notice these problems and I don't intend to dismiss them. Furthermore, it’s true that a perfect reasoner is a Bayesian reasoner, so why would we ever use an evaluative approach in science that can’t be extended into an ideal reasoning pattern?
Consider the following scenario: the Bad Chemicals Company sells a product called Dangerous Pesticide, which contains compounds which have recently been discovered to cause chronic halitosis. Alice and Bob want to know whether BCC knew about the dangers their product poses in advance of this public revelation. As a result of a lawsuit, several internal documents from BCC have been made public.
Alice thinks there’s a 30% chance that BCC knew about the halitosis problem in advance, whereas Bob thinks there’s a 90% chance. Both Alice and Bob agree that, if BCC didn’t know, there’s only a 5% chance that they would have produced internal research documents looking into potential causes of chronic halitosis in conjunction with Dangerous Pesticide. Now all Alice and Bob have to do is agree on the probability of such documents existing if BCC did know in advance, and they can do a Bayesian update! They won’t end up with identical posteriors, but if they agree about all of the relevant probabilities, they will necessarily agree more after collecting evidence than they did before.
But they can’t agree on how to update. Alice thinks that, if BCC knew, there’s a 95% chance that they’ll discover related internal documents. Bob, being a devout conspiracy theorist, thinks the chance is only 2% - if they knew about the problem in advance, then of course they would have been tipped off about the investigation in advance, they have spies everywhere and they’re not that sloppy, and why wouldn't the government just classify the smoking gun documents to keep the public in the dark anyway? They're already doing that about aliens, after all!
Alice thinks this is a bit ridiculous, but she knows the relevant agreement theorems, and Bob is at least giving probabilities and sticking to them, so she persists and subdivides the hypothesis space. She thinks there’s a 30% chance that BCC knew in advance, but only a 10% chance that they were tipped off. Bob thinks there’s a 90% chance they knew in advance, and an 85% chance they were tipped off. If they knew but were not tipped off, Alice and Bob manage to agree that there’s a 96% chance of discovering the relevant internal documents.
Now they just have to agree on the probability of discovering the related internal documents if there’s a conspiracy. But again, they fail to agree. You see, Bob explains, it all depends on whether the Rothschilds are involved - the Rothschilds are of course themselves vexed with chronic halitosis, which explains why they were so involved in the invention of the breathmint, and so if there were a secret coverup about the causes of halitosis, then of course the Rothschilds would have caught wind of this through their own secret information networks and intervened, and that’s not even getting into the relevant multi-faction dynamics! At this point Alice leaves and conducts her investigation privately, deciding that reaching agreement with Bob is more trouble than it’s worth.
My point is: we can guarantee reasonable updates when Bayesian reasoners agree on how to update on every hypothesis, but it’s extremely hard to come to such an agreement, and even reasoners who agree about the probability of some hypothesis X can disagree about the probability distribution “underneath” X, such that they disagree wildly about P(E|X). In practice we don’t exhaustively enumerate every sub-hypothesis, instead we make assumptions about causal mechanisms and so feel justified in saying that this sort of enumeration is not necessary. If we want to determine the gravitational constant, for example, it’s helpful to assume that the speed at which a marble falls does not meaningfully depend on its color.
And yet how can we do this? In the real world we rarely care about reaching rational agreement with Bob, and indeed we often have good reasons to suspect that this is impossible. But we do care about, for example, reaching rational agreement with those who believe that dark matter is merely a measurement gap, or with those who believe that AI cannot meaningfully progress beyond human intelligence with current paradigms. Disagreement about how to assign the probability mass underneath a hypothesis is the typical case. How could we reasonably come to agreement in a Bayesian framework when we cannot even in principle enumerate the relevant hypotheses, when we suspect that the correct explanation is not known to anybody at all?
Here’s one idea: enumerate, in exhaustive detail, just one hypothesis. Agree about one way the world could be - we don’t need to decide whether the Rothschilds have bad breath, let’s just live for a moment in the simple world where they aren't involved. Agree on the probability of seeing certain types of evidence if the world is exactly that way. If we cannot agree, identify the source of the disagreement and introduce more specificity. Design a repeatable experiment which, if our single hypothesis is wrong, might give different-from-expected results, and repeat that experiment until we get results that could not plausibly be explained by our preferred hypothesis. With enough repetition, even agents who have wildly different probability distributions on the complement should be able to agree that the one distinguished hypothesis is probably wrong. A one-in-a-hundred coincidence might still be the best explanation for a given result, but a one-in-a-hundred-trillion coincidence basically never is.
Not always, not only, but when you want your results to be legible and relevant to people with wildly different beliefs about the hypothesis space, you should at some point conduct a procedure along these lines.
That is to say, in the typical course of scientific discovery, you should compute a p-value.
It is fashionable, on LessWrong and also everywhere else, to advocate for a transition away from p-values. p-values have many known issues. p-hacking is possible and difficult to prevent, testing one hypothesis at a time cannot even in principle be correct, et cetera. I should mention here, because I will not mention it again, that these critiques are correct and very important - people are not wrong to notice these problems and I don't intend to dismiss them. Furthermore, it’s true that a perfect reasoner is a Bayesian reasoner, so why would we ever use an evaluative approach in science that can’t be extended into an ideal reasoning pattern?
Consider the following scenario: the Bad Chemicals Company sells a product called Dangerous Pesticide, which contains compounds which have recently been discovered to cause chronic halitosis. Alice and Bob want to know whether BCC knew about the dangers their product poses in advance of this public revelation. As a result of a lawsuit, several internal documents from BCC have been made public.
Alice thinks there’s a 30% chance that BCC knew about the halitosis problem in advance, whereas Bob thinks there’s a 90% chance. Both Alice and Bob agree that, if BCC didn’t know, there’s only a 5% chance that they would have produced internal research documents looking into potential causes of chronic halitosis in conjunction with Dangerous Pesticide. Now all Alice and Bob have to do is agree on the probability of such documents existing if BCC did know in advance, and they can do a Bayesian update! They won’t end up with identical posteriors, but if they agree about all of the relevant probabilities, they will necessarily agree more after collecting evidence than they did before.
But they can’t agree on how to update. Alice thinks that, if BCC knew, there’s a 95% chance that they’ll discover related internal documents. Bob, being a devout conspiracy theorist, thinks the chance is only 2% - if they knew about the problem in advance, then of course they would have been tipped off about the investigation in advance, they have spies everywhere and they’re not that sloppy, and why wouldn't the government just classify the smoking gun documents to keep the public in the dark anyway? They're already doing that about aliens, after all!
Alice thinks this is a bit ridiculous, but she knows the relevant agreement theorems, and Bob is at least giving probabilities and sticking to them, so she persists and subdivides the hypothesis space. She thinks there’s a 30% chance that BCC knew in advance, but only a 10% chance that they were tipped off. Bob thinks there’s a 90% chance they knew in advance, and an 85% chance they were tipped off. If they knew but were not tipped off, Alice and Bob manage to agree that there’s a 96% chance of discovering the relevant internal documents.
Now they just have to agree on the probability of discovering the related internal documents if there’s a conspiracy. But again, they fail to agree. You see, Bob explains, it all depends on whether the Rothschilds are involved - the Rothschilds are of course themselves vexed with chronic halitosis, which explains why they were so involved in the invention of the breathmint, and so if there were a secret coverup about the causes of halitosis, then of course the Rothschilds would have caught wind of this through their own secret information networks and intervened, and that’s not even getting into the relevant multi-faction dynamics! At this point Alice leaves and conducts her investigation privately, deciding that reaching agreement with Bob is more trouble than it’s worth.
My point is: we can guarantee reasonable updates when Bayesian reasoners agree on how to update on every hypothesis, but it’s extremely hard to come to such an agreement, and even reasoners who agree about the probability of some hypothesis X can disagree about the probability distribution “underneath” X, such that they disagree wildly about P(E|X). In practice we don’t exhaustively enumerate every sub-hypothesis, instead we make assumptions about causal mechanisms and so feel justified in saying that this sort of enumeration is not necessary. If we want to determine the gravitational constant, for example, it’s helpful to assume that the speed at which a marble falls does not meaningfully depend on its color.
And yet how can we do this? In the real world we rarely care about reaching rational agreement with Bob, and indeed we often have good reasons to suspect that this is impossible. But we do care about, for example, reaching rational agreement with those who believe that dark matter is merely a measurement gap, or with those who believe that AI cannot meaningfully progress beyond human intelligence with current paradigms. Disagreement about how to assign the probability mass underneath a hypothesis is the typical case. How could we reasonably come to agreement in a Bayesian framework when we cannot even in principle enumerate the relevant hypotheses, when we suspect that the correct explanation is not known to anybody at all?
Here’s one idea: enumerate, in exhaustive detail, just one hypothesis. Agree about one way the world could be - we don’t need to decide whether the Rothschilds have bad breath, let’s just live for a moment in the simple world where they aren't involved. Agree on the probability of seeing certain types of evidence if the world is exactly that way. If we cannot agree, identify the source of the disagreement and introduce more specificity. Design a repeatable experiment which, if our single hypothesis is wrong, might give different-from-expected results, and repeat that experiment until we get results that could not plausibly be explained by our preferred hypothesis. With enough repetition, even agents who have wildly different probability distributions on the complement should be able to agree that the one distinguished hypothesis is probably wrong. A one-in-a-hundred coincidence might still be the best explanation for a given result, but a one-in-a-hundred-trillion coincidence basically never is.
Not always, not only, but when you want your results to be legible and relevant to people with wildly different beliefs about the hypothesis space, you should at some point conduct a procedure along these lines.
That is to say, in the typical course of scientific discovery, you should compute a p-value.