Kicks open the door
Alright, here's the current state of affairs:
Or in other words, we suck. Lest anyone think I'm merely throwing stones, I screwed up Bayes the first time I tried to use it in public. I would not bet a lot on me getting any particular problem right. I suck too.
This version though? This I think most people could remember. I can do this version in my head. I've read a half-dozen explainers for Bayes, some with very nice pictures. This beats all of them, and it's in less than two hundred words! Maybe this is a case of Writing A Thousand Roads To Rome where this version happened to click with me but it's fundamentally just as good as many other versions. I suspect this is a simpler formulation.
Either someone needs to point out where this math is wrong, or I'm just going to use this version for myself and for explaining it to others. A much simpler version of the only non-commentary part of rationality seems a worthy use of Best of LessWrong space to me.
This version though? This I think most people could remember.
By most people you mean most people hanging around the lesswrong community because they know programming? I agree, an explanation that uses language that the average programmer can understand seems like a good strategy of explaining Bayes rule given the rationality communities demographics (above average programmers).
Maybe this is a case of Writing A Thousand Roads To Rome where this version happened to click with me but it's fundamentally just as good as many other versions. I suspect this is a simpler formulation.
Was it the code or the example that helped? The code is mostly fine. I don't think it is any simpler than the explanations here, the notation just looks scarier.
Either someone needs to point out where this math is wrong, or I'm just going to use this version for myself and for explaining it to others
This version is correct for naive bayes, but naive bayes is in fact naive and can lead you arbitrarily astray. If you wanted a non-naive version you would write something like this in pseudopython:
for i, E IN enumerate(EVIDENCE):
YEP *= CHANCE OF E IF all(YEP, EVIDENCE[:i])
NOPE *= CHANCE OF E IF all(NOPE, EVIDENCE[:i])
I see the case for starting with the naive version though, so this is more of a minor thing.
I don't see a lot more going for the bear example except for it being about something dramatic, so more memorable. Feels like you should be able to do strictly better examples. See Zane's objections in the other comment.
It's not the programming notation that makes it work for me (though that helps a little.) It's not the particular example either, though I do think it's a bit better than the abstract mammogram example. There's just way fewer numbers.
It's because the notation on each line contains two numbers, both of which are. . . primitives? atomic pieces? I can do them in one step. (My inner monologue goes something like "3:2 means the the first thing happens three times for every two times the second thing happens, over the long run anyway. There's five balls in the bag, three of the first colour and two of the second. Now more balls than that, but keep the ratio.")
And then if I want to do an update, I just need four numbers, each of which makes sense on their own, each of which is used in one place. 1:100, 20:50, multiply the left by the left (20) and the right by the right (5000) and now I have two numbers again. (20:5000) I can usually simplify that in my head (2:500, okay now 1:250.) The line "The colon is a thick & rubbery barrier. Yep with yep and nope with nope" helps a lot, I'm reminded to keep all the yeps on the left and the nopes on the right. Because multiplication is transitive, I can just keep doing that at each new update, never dealing with more than four numbers. If I'd rather (or if I'm using pen and paper) I can just write out a dozen updates and get the products after.
Compare this sucker:
Four numbers used in six places. I'll be the village idiot and admit I cannot reliably keep a phone number in my head without mental tricks. I have lost count of the number of times I have swapped P(A|B) and P(B|A) accidentally. The numbers aren't arranged on the page in a way that helps my intuition, like yeps being on top and nopes on the bottom or something.
Or compare the explanation at the first link you shared.
Bayes' rule in the odds form says that for every pair of hypotheses, their relative prior odds, times the relative likelihood of the evidence, equals the relative posterior odds.
Let be a vector of hypotheses Because Bayes' rule holds between every pair of hypotheses in we can simply multiply an odds vector by a likelihood vector in order to get the correct posterior vector:
where is the vector of relative prior odds between all the , is the vector of relative likelihoods with which each predicted and is the relative posterior odds between all the
In fact, we can keep multiplying by likelihood vectors to perform multiple updates at once:
I am trying to express I find that more complicated. I don't know what means. It took me a bit to remember what stands for. If you are ever trying to explain something to the general population and you need LaTeX to do it, stop what you are doing and come up with a new plan. Seven paragraphs into that page we get the odds form with the colon, and it's for three different hypothesis; I'm aware you can write odds like 3:2:4 but that's less common. Drunk people who flunked high school routinely calculate 3:2 in pubs! Start with the two hypothesis version, then maybe mention that you can do three hypotheses at once. "Shortest Goddamn Bayes Guide Ever" uses strictly symbols on a standard keyboard and math which is within the limits of an on-track fourth grader. It's less than two hundred words! The thing would fit in three tweets!
I think that is a masterwork of pedagogy and editing, worthy of praise and prominent place.
If there's a way to make this version work for non-naive updates that seems good, and my understanding is it's mostly about saying for each new line "given that the above has happened, what are the odds of this observation?" instead of "what are the odds of this observation assuming I haven't seen the above"? It's not like the P(A|B) formulation prevents people from making that exact mistake. (Citation, I have made that exact mistake.)
Interesting! Makes sense.
If there's a way to make this version work for non-naive updates that seems good, and my understanding is it's mostly about saying for each new line "given that the above has happened, what are the odds of this observation?"
Yes that's it. Yeah I am not trying to defend the probability version of bayes rule. When I was trying to explain bayes rule to my wordcel gf, I was also using the odds ratio.
Yes, odds notation is the only sane way to do Bayes. Who cares about Bayes theorem written out in math. Just think about hypotheses and likelihoods. If you need to rederive the math notation start from thinking in odds and rederive what you would do to get a probability out of odds.
I do sure feel confused why so many people mess up Bayes. The core of bayesian reasoning is literally just asking the question of "what is the probability that I would see this evidence given each one of my hypotheses", or in the case of a reasonable null hypothesis and a hypothesized conjecture the question of "would I be seeing anything different if I was wrong/if this was just noise?".
To be clear, this is also what is in all the standard Bayes guides. Eliezer's Bayes guide both on Yudkowsky.net and Arbital.com is centrally about the odds notation.
"20% a bear would scratch my tent : 50% a notbear would"
I think the chance that your tent gets scratched should be strictly higher if there's a bear around?
It doesn't matter how often the possum would have scratched it. If your tent would be scratched 50% of the time in the absence of a bear, and a bear would scratch it 20% of the time, then the chance it gets scratched if there is a bear is 1-(1-50%)(1-20%), or 60%. Unless you're postulating that bears always scare off anything else that might scratch the tent.
Also, what about how some of these probabilities are entangled with each other? Your tent being flipped over will almost always involve your tent being scratched, so once we condition on the tent being flipped over, that screens off the evidence from the tent being scratched.
Also, only 95% chance a bear would look like a bear? And only 0.01% chance it would eat you?
Realistically, once we've seen a bear-shaped object scratch your tent, flip it over, and start eating you, you should be way more confident than 38 to 1 that you're being eaten.
I was thinking the bear would scare other stuff off yeah. But now I think I'm doing this wrong and the code is broken. Can you fix my code?
You can just try to estimate the base rate of a bear attacking your tent and eating you, then estimate the base rate of a thing that looks identical to a bear attacking your tent and eating you, and compare them. Maybe one in a thousand tents get attacked by a bear, and 1% of those tent attacks end with the bear eating the person inside. The second probability is a lot harder to estimate, since it mostly involves off-model surprises like "Bigfoot is real" and "there is a serial killer in these woods wearing a bear suit," but I'd have trouble seeing how it could be above one in a billion. (Unless we're including possibilities like "this whole thing is just a dream" - which actually should be your main hypothesis.)
In general, when you're dealing with very low or very high probabilities, I'd recommend you just try to use your intuition instead of trying to calculate everything out explicitly.* The main reason is this: if you estimate a probability as being 30% instead of 50%, it won't usually affect the result of the calculation that much. On the other hand, if you estimate a probability as being 1/10^5 instead of 1/10^6, it can have an enormous impact on the end result. However, humans are a lot better at intuitively telling apart 30% from 50% than they are at telling apart 1/10^5 from 1/10^6.
If you try to do explicit calculations about probabilities that are pretty close to 1:1, you'll probably get a pretty accurate result; if you try to do explicit calculations about probabilities that are several orders of magnitude away from each other, you'll probably be off by at least one order of magnitude. In this case, you calculated that even if a person on a camping trip is being eaten by something that looks identical to a bear, there's still about a 2.6% chance that it's not a bear. When you get a result that ridiculous, it doesn't mean there's a nonbear eating you, it means you're doing the math wrong.
*The situations in which you can get useful information from an explicit calculation on low probabilities are situations where you're fine with being off by substantial multiplicative factors. Like, if you're making a business decision where you're only willing to accept a <5% chance of something happening, and you calculate that there's only a one in a trillion chance, then it doesn't actually matter whether you were off by a factor of a million to one. (Of course, you still do need to check that there's no way you could be off by an even larger factor than that.)
I'm not sure I'm following your actual objection. Is your point that this algorithm is wrong and won't update towards the right probabilities even if you keep feeding it new pieces of evidence, that the explanations and numbers for these pieces of evidence don't make sense for the implied story, that you shouldn't try to do explicit probability calculations this way, or some fourth thing?
If this algorithm isn't actually equivalent to Bayes in some way, that would be really useful for someone to point out. At first glance it seems like a simpler (to me anyway) way to express how making updates works, not just on an intuitive "I guess the numbers move that direction?" way but in a way that might not get fooled by e.g. the mammogram example.
If these explanations and numbers don't make exact sense for the implied story, that seems fine? "A train is moving from east to west at a uniform speed of 12 m/s, ten kilometers west a second train is moving west to east at a uniform speed of 15 m/s, how far will the first train have traveled when they meet?" is a fine word problem even if that's oversimplified for how trains work.
If you don't think it's worth doing explicit probability calculations this way, even to practice and try and get better or as a way to train the habit of how the numbers should move, that seems like a different objection and one you would have with any guide to Bayes. That's not to say you shouldn't raise the objection, but that doesn't seem like an objection that someone did the math wrong!
And of course maybe I'm completely missing your point.
Multiple points, really. I believe that this calculation is flawed in specific ways, but I also think that most calculations that attempt to estimate the relative odds of two events that were both very unlikely a priori will end up being off by a large amount. These two points are not entirely unrelated.
The specific problems that I noticed were:
And then the meta-problem: when you're multiplying together more than two or three probabilities that you estimated, particularly small ones, errors in your ability to estimate them start to add up. Which is why I don't think it's usually worthwhile to try and estimate probabilities like this.
But you have a fair point about it being a good idea to practice explicit calculations, even if they're too complicated to reliably get right in real life. So here's how I might calculate it:
P(bear encounters you): 1%.
P(tent scratched | bear): 60%, for the reasons I said above... unless we take into account it scaring away other tent-scratching animals, in which case maybe 40%.
P(tent flipped over | bear & tent scratched): 20%, maybe? I think if the bear has already taken an interest in your tent, it's more likely than usual to flip it over.
P(you see a bear-shaped object | bear & tent scratched & tent flipped over): Bears always look like bears. This is so close to 100% I wouldn't even normally include it in the calculation, but let's call it 99.99%.
P(you get eaten | bear & tent scratched & tent flipped over & you see a bear-shaped object): It's already pretty been aggressive so far, so I'd say perhaps 5%.
On the other side, there are almost no objects for which the probability of it looking exactly like a bear isn't infinitesimal; let's only consider Bigfoot and serial-killer-who's-a-furry for simplicity, then add them up.
P(Bigfoot exists): ...hmm. I am not an expert on the matter, but let's say 1%.
P(Bigfoot encounters you | Bigfoot exists): There can't be that many Bigfoots (Bigfeet?) out there, or else people would have caught one. 0.01%.
P(tent scratched | Bigfoot): Bigfeet are probably more aggressive than bears, so 70%.
P(tent flipped over | Bigfoot): Again, Bigfeet are supposed to be pretty aggressive, so 50%.
P(you see a bear-shaped object | Bigfoot & tent scratched & tent flipped over): Bigfoot looks similar enough to a bear that you'll almost certainly think he's a bear. 99%.
P(you get eaten | Bigfoot & tent scratched & tent flipped over & you see a bear-shaped object): Again, Bigfeet aggressive, 30%.
Then for the furry cannibal one:
P(furry cannibal stalking this forest): 0.000001% (that's one in a hundred million, if I got my zeroes right). I welcome you to prove me wrong on the matter by manually increasing the number of furry cannibals in a given forest.
P(furry cannibal encounters you | furry cannibal exists): How large of a forest is this? Well, he probably has his methods of locating prey, so let's say 10%. Wait, why did I assume he's a "he"? What gender is the typical furry cannibal? Probably a trans woman? Let's name this furry cannibal Susan.
P(tent scratched | Susan): Probably not that high; she doesn't want to wake you up too soon. 30%.
P(tent flipped over | Susan & tent scratched): She might just sneak in, but let's say 90%.
P(you see a bear-shaped object | Susan & tent scratched & tent flipped over): She's wearing a bear costume, as hypothesized; 99.99%.
P(you get eaten | Susan & tent scratched & tent flipped over & you see a bear-shaped object): Yes, of course this happens; this was her whole kink in the first place! 99%.
So for "bear," we have 1%*40%*20%*99.99%*5% = 0.004%. For "Bigfoot," we have 1%*0.01%*70%*50%*99%*30% = 0.00001%. For "Susan," we have 0.000001%*10%*30%*90%*99.99%*99% = .000000027%. Looks like Bigfoot was so much more likely than Susan that we can pretty much just forget the Susan possibility altogether. It's 0.004 to 0.00001, so 400 to 1 chance that you're being eaten by a bear.
(Although I actually think you should be even more confident than 400 to 1 that it's a bear rather than Bigfoot, and that I just was off by an order of magnitude for one reason or another, as happens when you're doing these sorts of calculations. And if you ever actually observe all of these things, the most likely hypothesis is that you're dreaming.)
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
FOR EACH E IN EVIDENCE
YEP *= CHANCE OF E IF YEP
NOPE *= CHANCE OF E IF NOPEThe thing to remember is that yeps and nopes never cross. The colon is a thick & rubbery barrier. Yep with yep and nope with nope.
bear : notbear =
1:100 odds to encounter a bear on a camping trip around here in general
* 20% a bear would scratch my tent : 50% a notbear would
* 10% a bear would flip my tent over : 1% a notbear would
* 95% a bear would look exactly like a fucking bear inside my tent : 1% a notbear would
* 0.01% chance a bear would eat me alive : 0.001% chance a notbear would
As you die you conclude 1*20*10*95*.01 : 100*50*1*1*.001 = 190 : 5 odds that a bear is eating you.