# 86

Previously, I defined evidence as “an event entangled, by links of cause and effect, with whatever you want to know about,” and entangled as “happening differently for different possible states of the target.” So how much entanglement—how much rational evidence—is required to support a belief?

Let’s start with a question simple enough to be mathematical: How hard would you have to entangle yourself with the lottery in order to win? Suppose there are seventy balls, drawn without replacement, and six numbers to match for the win. Then there are 131,115,985 possible winning combinations, hence a randomly selected ticket would have a 1/131,115,985 probability of winning (0.0000007%). To win the lottery, you would need evidence selective enough to visibly favor one combination over 131,115,984 alternatives.

Suppose there are some tests you can perform which discriminate, probabilistically, between winning and losing lottery numbers. For example, you can punch a combination into a little black box that always beeps if the combination is the winner, and has only a 1/4 (25%) chance of beeping if the combination is wrong. In Bayesian terms, we would say the likelihood ratio is 4 to 1. This means that the box is 4 times as likely to beep when we punch in a correct combination, compared to how likely it is to beep for an incorrect combination.

There are still a whole lot of possible combinations. If you punch in 20 incorrect combinations, the box will beep on 5 of them by sheer chance (on average). If you punch in all 131,115,985 possible combinations, then while the box is certain to beep for the one winning combination, it will also beep for 32,778,996 losing combinations (on average).

So this box doesn’t let you win the lottery, but it’s better than nothing. If you used the box, your odds of winning would go from 1 in 131,115,985 to 1 in 32,778,997. You’ve made some progress toward finding your target, the truth, within the huge space of possibilities.

Suppose you can use another black box to test combinations twice, independently. Both boxes are certain to beep for the winning ticket. But the chance of a box beeping for a losing combination is 1/4 independently for each box; hence the chance of both boxes beeping for a losing combination is 1/16. We can say that the cumulative evidence, of two independent tests, has a likelihood ratio of 16:1. The number of losing lottery tickets that pass both tests will be (on average) 8,194,749.

Since there are 131,115,985 possible lottery tickets, you might guess that you need evidence whose strength is around 131,115,985 to 1—an event, or series of events, which is 131,115,985 times more likely to happen for a winning combination than a losing combination. Actually, this amount of evidence would only be enough to give you an even chance of winning the lottery. Why? Because if you apply a filter of that power to 131 million losing tickets, there will be, on average, one losing ticket that passes the filter. The winning ticket will also pass the filter. So you’ll be left with two tickets that passed the filter, only one of them a winner. Fifty percent odds of winning, if you can only buy one ticket.

A better way of viewing the problem: In the beginning, there is 1 winning ticket and 131,115,984 losing tickets, so your odds of winning are 1:131,115,984. If you use a single box, the odds of it beeping are 1 for a winning ticket and 0.25 for a losing ticket. So we multiply 1:131,115,984 by 1:0.25 and get 1:32,778,996. Adding another box of evidence multiplies the odds by 1:0.25 again, so now the odds are 1 winning ticket to 8,194,749 losing tickets.

It is convenient to measure evidence in bits—not like bits on a hard drive, but mathematician’s bits, which are conceptually different. Mathematician’s bits are the logarithms, base 1/2, of probabilities. For example, if there are four possible outcomes A, B, C, and D, whose probabilities are 50%, 25%, 12.5%, and 12.5%, and I tell you the outcome was “D,” then I have transmitted three bits of information to you, because I informed you of an outcome whose probability was 1/8.

It so happens that 131,115,984 is slightly less than 2 to the 27th power. So 14 boxes or 28 bits of evidence—an event 268,435,456:1 times more likely to happen if the ticket-hypothesis is true than if it is false—would shift the odds from 1:131,115,984 to 268,435,456:131,115,984, which reduces to 2:1. Odds of 2 to 1 mean two chances to win for each chance to lose, so the probability of winning with 28 bits of evidence is 2/3. Adding another box, another 2 bits of evidence, would take the odds to 8:1. Adding yet another two boxes would take the chance of winning to 128:1.

So if you want to license a strong belief that you will win the lottery—arbitrarily defined as less than a 1% probability of being wrong—34 bits of evidence about the winning combination should do the trick.

In general, the rules for weighing “how much evidence it takes” follow a similar pattern: The larger the space of possibilities in which the hypothesis lies, or the more unlikely the hypothesis seems a priori compared to its neighbors, or the more confident you wish to be, the more evidence you need.

You cannot defy the rules; you cannot form accurate beliefs based on inadequate evidence. Let’s say you’ve got 10 boxes lined up in a row, and you start punching combinations into the boxes. You cannot stop on the first combination that gets beeps from all 10 boxes, saying, “But the odds of that happening for a losing combination are a million to one! I’ll just ignore those ivory-tower Bayesian rules and stop here.” On average, 131 losing tickets will pass such a test for every winner. Considering the space of possibilities and the prior improbability, you jumped to a too-strong conclusion based on insufficient evidence. That’s not a pointless bureaucratic regulation; it’s math.

Of course, you can still believe based on inadequate evidence, if that is your whim; but you will not be able to believe accurately. It is like trying to drive your car without any fuel, because you don’t believe in the fuddy-duddy concept that it ought to take fuel to go places. Wouldn’t it be so much more fun, and so much less expensive, if we just decided to repeal the law that cars need fuel?

Well, you can try. You can even shut your eyes and pretend the car is moving. But really arriving at accurate beliefs requires evidence-fuel, and the further you want to go, the more fuel you need.

# 86

New Comment

I'd be happy to buy lots of lottery tickets that had a 1/132 chance of winning, given the typical payoff structure of lotteries of the kind you describe.

To act rationally, it isn't enough to arrive at the correct (probabilities of) beliefs; to act on a belief, the degree of belief you need in it might not be very great.

Given the strong tendency to collapse all degrees of belief into a two-point scale (yea or nay) , I suspect that our intuitions about how much one has to believe in something in order to act accordingly are often too stringent, since the actual strengths of our beliefs are so often much too large.

(Note: "often" doesn't mean "always" or even "usually".)

Of course acting on beliefs is a decision theory matter. You don't have terribly much to lose by buying a losing lottery ticket, but you have a very large amount to gain if it wins, so yes 1/132 chance of winning sounds well worth \$20 or so.

This also shows why independently replicated scientific experiments (more independent boxes) are more important than experiments with high p-values (boxes with better likeliehood ratios).

But the p-values go exponentially close to one with the size of the study. If you had three studies that used 11 boxes, vs. one with 33, you'd get exactly the same posterior probability for the ticket being a winner.

In other words, more experiments are exponentially more valuable than higher p-values, but higher p-values are exponentially cheaper.

Anders, I'm not sure I'd agree with that, because of publication bias. I'd feel much better about a single experiment that reported p < 0.001 than three experiments that reported p < 0.05.

Yes, publication bias matters. But it also applies to the p<0.001 experiment - if we have just a single publication, should we believe that the effect is true and just one group has done the experiment, or that the effect is false and publication bias has prevented the publication of the negative results? If we had a few experiments (even with different results) it would be easier to estimate this than in the one published experiment case.

Lets do a check. Assume a worst case scenario where nobody publishes false results at all.

To get three p < 0.05 studies if the hypothesis is false requires on average 60 experiments. This is a lot but is within the realms of possibility if the issue is one which many people are interested in, so there is still grounds for scepticism of this result.

To get one p < 0.001 study if the hypothesis is false requires on average 1000 experiments. This is pretty implausible, so I would be much happier to treat this result as an indisputable fact, even in a field with many vested interests (assuming everything else about the experiment is sound).

To get one p < 0.0001 study if the hypothesis is false requires on average 1000 experiments

One too many zeros in the p value there. The 1,000 figure matches p<0.001, which is also what Anders mentioned. (So your point is fine.)

This is assuming proper methodology and statistics so that the p-value actually matches the chance of the result arising by chance. In practice, since even your best judgment of the methodology is not going to account for certainty in the soundness of the experiment, I would say that a p-value of 0.001 constitutes considerably less than 10 bits of evidence, because the odds that something was wrong with the experiment are better than the odds that the results were coincidental. Multiple experiments with lower cumulative p-value can still be stronger evidence if they all make adjustments to account for possible sources of error.

Running "1000 experiments" if you don't have to publish negative results, can mean just slicing data until you find something. Someone with a large data set can just do this 100% of the time.

A replication is more informative, because it's not subject to nearly as much "find something new and publish it" bias.

Erratum: In the beginning, there is 1 winning ticket and 131,115,984 losing tickets, so your odds of winning are 1:131,115,984.

Correct 1:131,115,985. (five as the last digit)

Sorry, ignore my erratum above, I was wrong. I mixed up odds and probability, they are different things.

Byrnema hosted an IRC Meeting about this post and I uploaded a transcript of the conversation on the wiki. If this was the wrong place to put the transcript let me know and I will move it.

The conversation went pretty well, in my opinion, and we plan on having a similar one next week.

The lottery is a good example, but the large numbers make it hard to follow the math without a calculator. Is there a simpler example you could add with lower numbers that we can hold in our heads?

It is convenient to measure evidence in bits - not like bits on a hard drive, but mathematician's bits, which are conceptually different. Mathematician's bits are the logarithms, base 1/2, of probabilities. For example, if there are four possible outcomes A, B, C, and D, whose probabilities are 50%, 25%, 12.5%, and 12.5%, and I tell you the outcome was "D", then I have transmitted three bits of information to you, because I informed you of an outcome whose probability was 1/8.

Here you say that bits = log(P(E|H)/P(E)). Everywhere else, you used bits = log(P(E|H)/P(E|!H)). They're very different.

Compare to this complaint heard in a fictitious physics classroom: "Now you say joules = 1/2 m v^2. But earlier you said joules = G m1 m2 / r and next you are going to say joules = m c^2. They are very different."

In the example I cited, P(I tell you outcome is D | outcome is D) = 1 and P(I tell you outcome is D | outcome is not D) = 0 (roughly). Thus log(P(E|H)/P(E)) = 3 and log(P(E|H)/P(E|!H)) = infinity. Log is base 1/2. Probability-bits and Odds-ratio-bits really are very different units, and Eliezer confusingly described them as the same thing. They are not interchangable like 1/2 m v^2, G m1 m2 / r, and m c^2.

I may be missing something here (and the karma voting patterns suggest that I am). But I will repeat my claim - perhaps with more clarity:

Bits are bits, just as joules are joules. But just as you can use joules as a unit to quantify different kinds of energy (kinetic, potential, relativistic), you can use bits as a unit to quantify different kinds of information (log odds-ratio, log likelihood ratio, channel capacity (in some fixed amount of time), entropy of a message source. Each of these kinds of information is measured in the same unit - bits.

You can measure evidence in bits, and you can measure the information content of the answer to a question in bits. The two are calculated using different formulas, because they are different things. Just as potential and kinetic energy are different things.

You are correct that bits can be used to measure different things. The problem here is that probabilities and odds ratios describe the exact same thing in different ways. A joule of potential energy is not the same thing as a joule of kinetic energy, but they can be converted to each other at a 1:1 ratio. A probability-bit measures the same thing as an odds-ratio-bit, but is a different quantity (a probability-bit is always greater than 1 odds-ratio-bit, and can be up to infinity odds-ratio-bits). A "bit of evidence" does not unambiguously tell someone whether you mean probability-bit or odds-ratio-bit, and Eliezer does not distinguish between them properly.

1 probability bit in favor of a hypothesis gives you a posterior probability of 1/2^(n-1) from a prior of 1/2^n. n probability bits gives you a posterior of 1 from the same prior.

1 odds ratio bit in favor of a hypothesis gives you a posterior odds ratio of 1:2^(n-1) from a prior of 1:2^n. n probability bits gives you a posterior odds ratio of 1:1 (probability 1/2) from the same prior. It takes infinity probability bits to give you a posterior probability of 1.

As the prior probability approaches 0, the types of bits become interchangeable.

Clearly you understand me now, and I think that I understand you.

A "bit of evidence" does not unambiguously tell someone whether you mean probability-bit or odds-ratio-bit, and Eliezer does not distinguish between them properly.

OK, if what is at issue here is whether Eliezer was sufficiently clear, then I'll bow out. Obviously, he was not sufficiently clear from your viewpoint. I will say, though, that your comment is the first time I have seen the word "evidence" used by a Bayesian for anything other than a log odds ratio.

Log odds evidence has the virtue that it is additive (when independent). On the other hand, your idea of a log probability meaning of 'evidence' has the virtue that a question can be decided by a finite amount of evidence.

I will say, though, that your comment is the first time I have seen the word "evidence" used by a Bayesian for anything other than a log odds ratio.

Eliezer used it to mean log probability in the section that I quoted. That was what I was complaining about.

Ok, I think you are misinterpreting, but I see what you mean. When EY writes:

...I have transmitted three bits of information to you, because I informed you of an outcome whose probability was 1/8.

I take this as illustrating the definition of bits in general, rather than bits of 'evidence'. But, yes, I agree with you now that placing that explanation in a paragraph with that lead sentence promising a definition of 'evidence' - well it definitely could have been written more clearly.

Maybe I'm confused, but isn't log_2(131,115,984) about 26.9, and not greater than 27?

[This comment is no longer endorsed by its author]Reply

Ok I see, so do you always just add one bit?

i thought the same, did someone ever replied?

You need *at least* 26.9 bits. Since the boxes he talked about provide 2 bits each, you need 14 boxes to get *at least* 26.9 bits (13 boxes would only be 26 bits, not enough). 14 boxes happens to be 28 bits.

just to be clear for my sake, the log_2 of the likely-hood ratio is how many bits that piece of evidence is worth?

edit: should I take no one correcting me as no one knowing, or being right?

131,115,985 to 1 [...] this amount of evidence would only be enough to give you an even chance of winning the lottery.

The number of false bleeps is distributed almost exactly Poisson with $\lambda=1$. The important figure is not the expected number of bleeps ($\mathbf Ex+1$, which is indeed 2). It's the expected probability that a random bleep is the true one, $\mathbf E\tfrac1{x+1}$. At the moment I can't find an analytic solution (and a short search suggests none is known), but a computation shows the result is around 63.2%, much better than 50%. Similarly, with 14 boxes (arguably "28 bits of evidence"), the chance of winning is about 79.1% on average, much better than $\tfrac23$.

Let's say you've got 10 boxes lined up in a row, and you start punching combinations into the boxes. You cannot stop on the first combination that gets beeps from all 10 boxes, saying, "But the odds of that happening for a losing combination are a million to one! I'll just ignore those ivory-tower Bayesian rules and stop here." On average, 131 losing tickets will pass such a test for every winner

Huh?