Bayes' Law is About Multiple Hypothesis Testing

by abramdemski5 min read4th May 20185 comments

85

Bayes' Theorem
Frontpage

I've called outside view the main debiasing technique, and I somewhat stand by that, not only because base-rate neglect can account for a variety of other biases, but also because outside view is about working on the policy level, which you have to do to implement other debiasing strategies.

Nonetheless, I am here today to tell you why the Method of Multiple Working Hypotheses is a central technique. T. C. Chamberlin wrote about it in 1897. More recently, Heuer discusses a very similar technique in Psychology of Intelligence Analysis, which served for a time as the debiasing handbook for the CIA. Heuer called his version Analysis of Competing Hypotheses.

(So, we could call it Method of Multiple Hypotheses (MMH), Analysis of Competing Hypotheses (ACH), or perhaps Analysis of Alternative Hypotheses (AAH) -- it seems doomed to be abbreviated as some variety of grunt.)

Heuer found that asking people to articulate the assumptions behind their assertions did not work very well -- analysts tend to insist that their conclusions follow directly from looking at the data, with no assumptions in between. (It is difficult to see the lens which you use to see!) However, if you instead ask people to compare their conclusions to other possibilities, they start noticing the assumptions which pointed them in one direction rather than another.

In order to make it stick in people's heads, I want to explain why it is just about inevitable from Bayes' Law.

Bayes' Law compares hypotheses to each other in terms of their likelihood ratios, balanced by the priors. Testing a single hypothesis feels meaningful, perhaps because in logical/deterministic cases we sometimes can prove or disprove something on its own. In the general case, though, we have to compare a hypothesis to alternatives to say anything meaningful. It's much like trying to evaluate a plan in isolation -- you can figure out a probability of success, or an expected value, but this is meaningless in isolation. You need to compare it to alternatives to know anything about whether you want to enact the plan. And, not just any alternatives; the best alternatives you can come up with.

Similarly, it only makes sense to evaluate hypotheses by looking at their relative likelihoods in comparison to a number of other hypotheses, and relative prior probabilities.

This is the thing which null hypothesis testing is sweeping under the rug. Null hypothesis testing attempts to fake testing a single hypothesis in isolation by comparing it to a "null" hypothesis which is taken to be the default thing we would believe. This often makes enough sense to not be glaringly terrible, but misrepresents the epistemics. There should not be special hypotheses which we consider "default".

A common way of writing Bayes' Law makes it look as if you can judge probability in isolation:

Variables 'h' and 'e' here are supposed to remind us of 'hypothesis' and 'evidence'. It looks like we're able to evaluate hypothesis on its own merits. However, another common statement of the law shows some of the complexity by expanding out the denominator:

In words: we can judge hypotheses in isolation by multiplying their prior probability by their likelihood . We could call the "goodness" of . This doesn't give a number which sums to one, though; we have to normalize, by dividing the "goodness" of each hypothesis by the total "goodness" of all hypotheses. The resulting number is between 0 and 1, so it can be a probability; indeed, it is the posterior probability.

However, note that the revised formulation represents alternatives to simply by the negation, . This is still hiding a lot of complexity. How do we compute the "goodness" of not-? In simple situations, this might be clear. But, to my mind, this invites the same sort of mistakes which can be made in null hypothesis testing: testing against a straw "default" hypothesis, rather than against the strongest alternative hypotheses you can think of.

Yet another common form of Bayes' Law unpacks this simplification. We consider a family of hypotheses, :

Now we've got it: we see the need to enumerate every hypothesis we can in order to test even one hypothesis properly. The previous use of was just hiding "all the other hypotheses", and the original denominator, , hid it further still.

It's like... optimizing is always about evaluating more and more alternatives so that you can find better and better things. Optimizing for accurate beliefs is different only in that you want to weigh your several options together, rather than taking only the best one after. But, still, how can you expect to find good hypotheses if you're not generating as many as you can and allowing them to compete on the data?

Heuer tries to get people to do this by telling them to make a grid, with all the hypotheses written on the top and all the significant pieces of evidence written on the side. Rather than figuring out exact likelihood ratios, you can write "+" or "-" to indicate very roughly how well hypotheses match up to evidence:

In fleshing out this fake example, it occurred to me that I had to also include "no data breach" to be able to examine the evidence in favor of the breach. Really, it should be split into more hypotheses (which might give alternative explanations of why Victor knew too much). As we see, the evidence in favor of the breach is actually not as strong as one might think, given the priors against it and the lack of evidence in favor of any particular type of breach. (However, we can also see how rough and potentially misleading simply writing plusses and minuses can be!)

This seems better than nothing, but I can see several problems with it:

  • It is easy to forget the "prior" -- I had to lump it in with evidence. In fact, I think Heuer doesn't put the prior in at all.
  • The chart format makes you think of "compatibility" between hypothesis and evidence in a fairly symmetric way; it doesn't jump out at you that you're supposed to be writing rather than .

In any case, I think the cognitive gear which Heuer and Chamberlin are pointing at is very important. It is more precise than the common pattern "try very hard to falsify your hypothesis" (though that mental movement may still prove useful), because it isn't obvious how to try to falsify a hypothesis; coming up with good alternative hypotheses is a necessary step.

When I first read about Heuer's ACH method, I remember having thoughts along the lines of "this debiases in a lot of different ways!" -- but I can't recall the biases I thought it covered, now. Fortunately, cousin_it has recently been thinking about it, and made his own attempt to list implications, which I'll quote in whole:

T.C. Chamberlin's "Method of Multiple Working Hypotheses", as discussed by Abram here, is pretty much a summary of LW epistemic rationality. The idea is that you should look at your data, your hypothesis, and the next best hypothesis that fits the data. Some applications:
Wason 2-4-6 task: if you receive information that 1-2-3 is okay and 2-4-6 is okay while 3-2-1 isn't, and your hypothesis is that increasing arithmetic progressions are okay, the next best hypothesis for the same data is that all increasing sequences are okay. That suggests the next experiment to try.
Hermione and Harry with the soda: if the soda vanishes when spilled on the robes, and your hypothesis is that the robes are magical, the next best hypothesis is that the soda is magical. That suggests the next experiment to try.
Einstein's arrogance: if you have a hypothesis and you've tried many next best hypotheses on the same data, you can be arrogant before seeing new data.
Witch trials: if the witch is scared of your questioning, and your hypothesis is that she's scared because she's guilty, the next best hypothesis is that she's scared of being killed. If your data doesn't favor one over the other, you have no business thinking about such things.
Mysterious answers: if you don't know anything about science, and your hypothesis is that sugar is sweet because its molecule is triangular, the next best hypothesis is that the molecule is square shaped. If your data doesn't favor one over the other, you have no business thinking about such things.
Religion: if you don't see any miracles, and your hypothesis is that God is hiding, the next best hypothesis is that God doesn't exist.
And so on. It's interesting how many ideas this covers.

The way this has entered into my personal thought patterns is: when I've come to some solid-seeming conclusion (in my own thoughts or in discussion), make it a principle to list alternatives (until the point where it has more cost than expected benefit). I think this has saved me a month or two of wasted effort on one occasion (though it is possible I would have noticed the problem sooner than that by some other means).

Happy debiasing!

85

5 comments, sorted by Highlighting new comments since Today at 7:27 AM
New Comment
Now we've got it: we see the need to enumerate every hypothesis we can in order to test even one hypothesis properly.

A cached handle I have for this is "the negation of a hypothesis is not a hypothesis"; said another way, "the negation of a model is not a model." Insofar as a hypothesis / model is a thing that makes predictions, "not (a thing that makes predictions)" isn't a thing that makes predictions. E.g. "person X just didn't understand the concept" is not a hypothesis about what's going on when person X gets a problem wrong on a test.

Typos: "MHH" -> "MMH", "lense" -> "lens"

You seem to be a little confused about hypothesis testing: the null hypothesis is the one that actually makes predictions, whereas the "alternative hypothesis" is the one that just states "the null hypothesis is false" (and so doesn't make predictions). Of course, the null hypothesis is often a strawman, while the actual most plausible hypotheses go unrepresented in the calculation (but are supposedly made more plausible by the failure of the null hypothesis).

I think the chart's ambiguity between and isn't too serious, since I think lines up more with the intuitive concept of "compatibility" than does.

Finally, to push back slightly on your main argument, sometimes the most important hypotheses are the ones that you can't state explicitly right now. In which case maybe you need some sort of "default hypothesis" to represent this possibility. Though such calculations are certainly something to be more skeptical of.

Ah... yeah, I forgot that the non-null hypothesis being tested isn't explicitly represented.

Finally, to push back slightly on your main argument, sometimes the most important hypotheses are the ones that you can't state explicitly right now. In which case maybe you need some sort of "default hypothesis" to represent this possibility. Though such calculations are certainly something to be more skeptical of.

I think I've seen a paper put forward that kind of approach (I don't remember enough to find it right now), but yeah, it is hard to see how a "default hypothesis" can be representative enough of all the neglected hypotheses.

Taking a logical-induction approach to the problem, we could say: it is possible to have a principled estimate of the probability which does not add up to the average probability assigned by all the hypotheses we can explicitly write down, because we can learn adjustment heuristics through experience (such as "probabilities estimated from the explicit hypotheses I can think of to write down tend to be overconfident by about x%).

It started to look like AIXI, where you create all possible hypothesis and weight them one against another. In AIXI simplest hypothesis in Kolmogorov complexity sense are regarded as the best alternative.

I tried to implement something similar when I created my roadmaps where I listed all possible ways how something could happen (mostly different x-risks). Typically, I limited myself to around 100 ideas, as I have intuition that first 100 hypothesis are enough. However, this intuition is not yet supported, and I would be interested to find the ways to estimate how many hypothesis should be listed before the correct one is in the list. If the number is very large, like 1000s, when listing hypothesis is not productive.

I would be interested to find the ways to estimate how many hypothesis should be listed before the correct one is in the list. If the number is very large, like 1000s, when listing hypothesis is not productive.

I think the better question is how many hypotheses should be listed before the value of information is too low to be worth continuing.

If you think (exactly) one of the hypotheses is correct, then the prior probability that you have already included the correct one in your list is exactly the sum of the prior probabilities of all hypotheses so far. The posterior probability that your list contains the correct hypothesis cannot be computed, though, since it requires knowledge of the prior probability of the observed evidence (which requires summing over all the hypotheses, including the ones you didn't list yet).

(If more than one hypothesis can be correct due to several hypotheses being equivalent, the probability is higher.)

If there is a chance reality is not any hypothesis you would ever list, then you could multiply the above calculation by the probability reality is one of the hypotheses you would ever list.

All this seems rather artificial, since it assumes the probabilities in the prior are meaningful, but it seems to me that if we're asking what the probability that we've already listed the correct hypothesis is, we don't want to trust the prior. But, what else can you do?

However, getting the correct hypothesis in your list is much less important than getting hypotheses which are good enough to help you make accurate decisions later. That's why I said value of information seems like the more relevant measurement. It seems like this can't be estimated without knowing anything about the hypotheses you haven't listed yet, though.