Kicks open the door
Alright, here's the current state of affairs:
Or in other words, we suck. Lest anyone think I'm merely throwing stones, I screwed up Bayes the first time I tried to use it in public. I would not bet a lot on me getting any particular problem right. I suck too.
This version though? This I think most people could remember. I can do this version in my head. I've read a half-dozen explainers for Bayes, some with very nice pictures. This beats all of them, and it's in less than two hundred words! Maybe this is a case of Writing A Thousand Roads To Rome where this version happened to click with me but it's fundamentally just as good as many other versions. I suspect this is a simpler formulation.
Either someone needs to point out where this math is wrong, or I'm just going to use this version for myself and for explaining it to others. A much simpler version of the only non-commentary part of rationality seems a worthy use of Best of LessWrong space to me.
This version though? This I think most people could remember.
By most people you mean most people hanging around the lesswrong community because they know programming? I agree, an explanation that uses language that the average programmer can understand seems like a good strategy of explaining Bayes rule given the rationality communities demographics (above average programmers).
Maybe this is a case of Writing A Thousand Roads To Rome where this version happened to click with me but it's fundamentally just as good as many other versions. I suspect this is a simpler formulation.
Was it the code or the example that helped? The code is mostly fine. I don't think it is any simpler than the explanations here, the notation just looks scarier.
Either someone needs to point out where this math is wrong, or I'm just going to use this version for myself and for explaining it to others
This version is correct for naive bayes, but naive bayes is in fact naive and can lead you arbitrarily astray. If you wanted a non-naive version you would write something like this in pseudopython:
for i, E IN enumerate(EVIDENCE):
YEP *= CHANCE OF E IF all(YEP, EVIDENCE[:i])
NOPE *= CHANCE OF E IF all(NOPE, EVIDENCE[:i])
I see the case for starting with the naive version though, so this is more of a minor thing.
I don't see a lot more going for the bear example except for it being about something dramatic, so more memorable. Feels like you should be able to do strictly better examples. See Zane's objections in the other comment.
It's not the programming notation that makes it work for me (though that helps a little.) It's not the particular example either, though I do think it's a bit better than the abstract mammogram example. There's just way fewer numbers.
It's because the notation on each line contains two numbers, both of which are. . . primitives? atomic pieces? I can do them in one step. (My inner monologue goes something like "3:2 means the the first thing happens three times for every two times the second thing happens, over the long run anyway. There's five balls in the bag, three of the first colour and two of the second. Now more balls than that, but keep the ratio.")
And then if I want to do an update, I just need four numbers, each of which makes sense on their own, each of which is used in one place. 1:100, 20:50, multiply the left by the left (20) and the right by the right (5000) and now I have two numbers again. (20:5000) I can usually simplify that in my head (2:500, okay now 1:250.) The line "The colon is a thick & rubbery barrier. Yep with yep and nope with nope" helps a lot, I'm reminded to keep all the yeps on the left and the nopes on the right. Because multiplication is transitive, I can just keep doing that at each new update, never dealing with more than four numbers. If I'd rather (or if I'm using pen and paper) I can just write out a dozen updates and get the products after.
Compare this sucker:
Four numbers used in six places. I'll be the village idiot and admit I cannot reliably keep a phone number in my head without mental tricks. I have lost count of the number of times I have swapped P(A|B) and P(B|A) accidentally. The numbers aren't arranged on the page in a way that helps my intuition, like yeps being on top and nopes on the bottom or something.
Or compare the explanation at the first link you shared.
Bayes' rule in the odds form says that for every pair of hypotheses, their relative prior odds, times the relative likelihood of the evidence, equals the relative posterior odds.
Let be a vector of hypotheses Because Bayes' rule holds between every pair of hypotheses in we can simply multiply an odds vector by a likelihood vector in order to get the correct posterior vector:
where is the vector of relative prior odds between all the , is the vector of relative likelihoods with which each predicted and is the relative posterior odds between all the
In fact, we can keep multiplying by likelihood vectors to perform multiple updates at once:
I am trying to express I find that more complicated. I don't know what means. It took me a bit to remember what stands for. If you are ever trying to explain something to the general population and you need LaTeX to do it, stop what you are doing and come up with a new plan. Seven paragraphs into that page we get the odds form with the colon, and it's for three different hypothesis; I'm aware you can write odds like 3:2:4 but that's less common. Drunk people who flunked high school routinely calculate 3:2 in pubs! Start with the two hypothesis version, then maybe mention that you can do three hypotheses at once. "Shortest Goddamn Bayes Guide Ever" uses strictly symbols on a standard keyboard and math which is within the limits of an on-track fourth grader. It's less than two hundred words! The thing would fit in three tweets!
I think that is a masterwork of pedagogy and editing, worthy of praise and prominent place.
If there's a way to make this version work for non-naive updates that seems good, and my understanding is it's mostly about saying for each new line "given that the above has happened, what are the odds of this observation?" instead of "what are the odds of this observation assuming I haven't seen the above"? It's not like the P(A|B) formulation prevents people from making that exact mistake. (Citation, I have made that exact mistake.)
Interesting! Makes sense.
If there's a way to make this version work for non-naive updates that seems good, and my understanding is it's mostly about saying for each new line "given that the above has happened, what are the odds of this observation?"
Yes that's it. Yeah I am not trying to defend the probability version of bayes rule. When I was trying to explain bayes rule to my wordcel gf, I was also using the odds ratio.
Yes, odds notation is the only sane way to do Bayes. Who cares about Bayes theorem written out in math. Just think about hypotheses and likelihoods. If you need to rederive the math notation start from thinking in odds and rederive what you would do to get a probability out of odds.
I do sure feel confused why so many people mess up Bayes. The core of bayesian reasoning is literally just asking the question of "what is the probability that I would see this evidence given each one of my hypotheses", or in the case of a reasonable null hypothesis and a hypothesized conjecture the question of "would I be seeing anything different if I was wrong/if this was just noise?".
To be clear, this is also what is in all the standard Bayes guides. Eliezer's Bayes guide both on Yudkowsky.net and Arbital.com is centrally about the odds notation.
I'm confused and alarmed that there is apparently some very large group of people who consider themselves rationalists but do not understand bayes theorem. (Bayesian statistics as a whole is a lot more complicated, of course, but Bayes theorem is not the hard part.) it's not a particularly complicated piece of math! the core idea can definitely trip you up if you've never ever heard of it, but it's also not that deep and shouldn't be hard to explain. and even if you don't remember the exact formula, it should be very easy to rederive within a few minutes from first principles once you understand the core idea.
how did we end up in a world where a community that attracts people of above-average intelligence and education, that places a large emphasis on math and STEM, that worships a particular theorem for some reason and has produced dozen(s?) of explainers for that theorem, still end up having the median member not understand the theorem? I think the missing thing here is not the one true intuitive bayes guide that will once and for all explain things the right way.
I am if not alarmed then at least consider it a problem, but haven't felt confusion here for at least a year. I have a pretty good model of how it happens. Someone's doing some searching on the internet, and gets recommended a LessWrong article on an Boston rents, or an AI paper, or hikers going missing. Maybe a friend recommended them a fun essay on miracles or a goofy Harry Potter fanfic. They hang around, read a few more things, comment a bit. Then they see a meetup announcement, and show up, and enjoy conversation. (Very very roughly a third of LessWrong/ACX meetups are socials, with no or minimal readings or workshops.) They go to more meetups, they make more comments on the internet, maybe they make some posts of their own and their posts get upvoted. Maybe they step up and run the meetup when the previous organizer is sick or busy or moves.
At no point did someone give them a math test. I'm basically describing my arc above, and nobody asked me to solve a mammogram problem in that process.
That's how we end up in this world.
As for what the missing thing is: my theory is to change this state of affairs, we'd need two things. We'd need to start actually regularly asking folks questions where they'd need to use it, and we'd need an explanation fast and simple enough that it can survive being taught by non-specialists who are also juggling having snacks out and getting the door for people. I love this not for its intuitiveness, but for rearranging the numbers to a shape people can do easier.
I'd give much higher odds on members of the community being able to gesture at the key ideas of base rates and priors in English sentences! (Not as high as I'd like, but higher, anyway.) But that's not the same as being able to do the calculations. And there's something slippery about describing a piece of math in intuitive sentences then trying to use it as a heuristic without quite being able to actually run the numbers, which is why I'd like to change that.
Again, I suck too! I'm running around doing a dozen things in my day to day life, none of which is remedial math practice. This kind of thing happens a lot actually. Once upon a time I did some basic interviews for some software developers, and watched comp sci grads fail to fizzbuzz correctly.
My hope is that if somehow I can get a tweet or two worth of text that teaches the numbers in a way that can fit in math people already do in their daily life (multiplication between two to four numbers) and add a small battery of exercises that use it, I might be able to package that in a way local organizers not only could use but would spread. Like you say, maybe hoping for just one more Bayes explanation is not the path. To me, this one was a meaningful step simpler and easier.
I guess I'll note as well that I want to raise the sanity waterline. To do that, I can't work with a version that wants above average intelligence. I do genuinely want to figure out how to teach Bayes to fourth graders and then go out and teach some fourth graders. C'mon, don't you want to see what people turn out like if they have access to a better mental toolkit from a young age?
Also,
it's not a particularly complicated piece of math ... even if you don't remember the exact formula, it should be very easy to rederive within a few minutes from first principles once you understand the core idea.
I think you might be having an xkcd feldspar moment.
I'm 13 and I consider myself to have a sufficient understanding of Bayes, or at least I'd be able to write it out and use it in basic situations. This community seems to be filled of pretty smart people, I haven't been to enough events to make a sustained argument on this. I find this guide to be even more confusing than learning the formula itself, but maybe that's just my perception.
Some cruxes for me:
I like this post a lot. But I wanted to second Morpheus' point that it's not actually quite correct.
Instead of this:
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
FOR EACH E IN EVIDENCE
YEP *= CHANCE OF E IF YEP
NOPE *= CHANCE OF E IF NOPEIt should be:
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
FOR EACH E IN EVIDENCE
YEP *= CHANCE OF E IF {YEP & ALL PREVIOUSLY SEEN E}
NOPE *= CHANCE OF E IF {NOPE & ALL PREVIOUSLY SEEN E}This is true because
P(YEP | E1, E2, ..., EN) / P(NOPE | E1, E2, ..., EN)
= P(YEP, E1, E2, ..., EN) / P(NOPE, E1, E2, ..., EN)
= P(YEP) / P(NOPE)
× P(E1 | YEP) / P(E2 | NOPE)
× P(E2 | YEP, E1) / P(E2 | NOPE, E1)
× ...
× P(EN | YEP, E1, ..., E{N-1}) / P(E2 | NOPE, E1, ... E{N-1})The recipe in the post is only true assuming that all the evidence is independent given YEP or NOPE. But that's rarely true, and (in my view) the most common mistake that people actually make when trying to apply Bayesian reasoning in practice, and leads to the kinds of crazy over-confident posteriors we see in certain things like the Rootclaim Covid-19 debate.
I reiterate that I like this post a lot! The point of the post isn't to make some novel mathematical contribution but to explain things in a vivid way. I think it succeeds there and with higher "production values" I think this post might have the potential to be wildly influential.
I guess that depends on what "ASSUME E" means. It's correct if interpreted the right way, but I think it's pretty ambiguous compared to the rest of the code, so personally I don't love it. I also think it's a bit confusing to use "ASSUME E" and also use "IF YEP" when those are both conditioning operations. Maybe it would be slightly better if you wrote something like
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
FOR EACH E IN EVIDENCE
YEP *= CHANCE OF E IF YEP AND ASSUMPTIONS
NOPE *= CHANCE OF E IF NOPE AND ASSUMPTIONS
ASSUME E // DO NOT DOUBLE COUNTOr if you're OK using a little set notation:
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
SEEN = {}
FOR EACH E IN EVIDENCE
YEP *= CHANCE OF E IF {YEP} ∪ SEEN
NOPE *= CHANCE OF E IF {NOPE} ∪ SEEN
SEEN = SEEN ∪ {E} // DO NOT DOUBLE COUNTLike my original proposal of YEP *= CHANCE OF E IF {YEP & ALL PREVIOUSLY SEEN E} these are both (to me) unambiguous. But they're kind of clunky. I'm not sure if there is a good solution!
Please no set notation! The arrow brackets are on thin ice I think.
I meant what I said above, I think there's something really good about having a Bayes explanation that requires no symbols not on a standard keyboard and no math an on-track fourth grader wouldn't know.
(And also, thank you both for improving this! I recognize you two are the ones in the arena at the moment and I wish I was able to help refine this more.)
Another proposal! (I think I'm too close to this to really judge what's best.)
// ODDS = YEP:NOPE
YEP = MAKE SOMETHING UP WHATEVER LOL
ASSUME YEP
FOR EACH E IN EVIDENCE
YEP *= HOW SURPRISING IS E?
ASSUME E
THROW ASSUMPTIONS IN TRASH
NOPE = MAKE SOMETHING UP WHATEVER LOL
ASSUME NOPE
FOR EACH E IN EVIDENCE
YEP *= HOW SURPRISING IS E?
ASSUME EThat is a clever way to succinctly say it. However, I worry that I only understood that because I already was aware of the concept. Perhaps I should show this to some smart friends with basic math chops who don't already know about the whole naive bayes thing,
"20% a bear would scratch my tent : 50% a notbear would"
I think the chance that your tent gets scratched should be strictly higher if there's a bear around?
It doesn't matter how often the possum would have scratched it. If your tent would be scratched 50% of the time in the absence of a bear, and a bear would scratch it 20% of the time, then the chance it gets scratched if there is a bear is 1-(1-50%)(1-20%), or 60%. Unless you're postulating that bears always scare off anything else that might scratch the tent.
Also, what about how some of these probabilities are entangled with each other? Your tent being flipped over will almost always involve your tent being scratched, so once we condition on the tent being flipped over, that screens off the evidence from the tent being scratched.
Also, only 95% chance a bear would look like a bear? And only 0.01% chance it would eat you?
Realistically, once we've seen a bear-shaped object scratch your tent, flip it over, and start eating you, you should be way more confident than 38 to 1 that you're being eaten.
I was thinking the bear would scare other stuff off yeah. But now I think I'm doing this wrong and the code is broken. Can you fix my code?
You can just try to estimate the base rate of a bear attacking your tent and eating you, then estimate the base rate of a thing that looks identical to a bear attacking your tent and eating you, and compare them. Maybe one in a thousand tents get attacked by a bear, and 1% of those tent attacks end with the bear eating the person inside. The second probability is a lot harder to estimate, since it mostly involves off-model surprises like "Bigfoot is real" and "there is a serial killer in these woods wearing a bear suit," but I'd have trouble seeing how it could be above one in a billion. (Unless we're including possibilities like "this whole thing is just a dream" - which actually should be your main hypothesis.)
In general, when you're dealing with very low or very high probabilities, I'd recommend you just try to use your intuition instead of trying to calculate everything out explicitly.* The main reason is this: if you estimate a probability as being 30% instead of 50%, it won't usually affect the result of the calculation that much. On the other hand, if you estimate a probability as being 1/10^5 instead of 1/10^6, it can have an enormous impact on the end result. However, humans are a lot better at intuitively telling apart 30% from 50% than they are at telling apart 1/10^5 from 1/10^6.
If you try to do explicit calculations about probabilities that are pretty close to 1:1, you'll probably get a pretty accurate result; if you try to do explicit calculations about probabilities that are several orders of magnitude away from each other, you'll probably be off by at least one order of magnitude. In this case, you calculated that even if a person on a camping trip is being eaten by something that looks identical to a bear, there's still about a 2.6% chance that it's not a bear. When you get a result that ridiculous, it doesn't mean there's a nonbear eating you, it means you're doing the math wrong.
*The situations in which you can get useful information from an explicit calculation on low probabilities are situations where you're fine with being off by substantial multiplicative factors. Like, if you're making a business decision where you're only willing to accept a <5% chance of something happening, and you calculate that there's only a one in a trillion chance, then it doesn't actually matter whether you were off by a factor of a million to one. (Of course, you still do need to check that there's no way you could be off by an even larger factor than that.)
I'm not sure I'm following your actual objection. Is your point that this algorithm is wrong and won't update towards the right probabilities even if you keep feeding it new pieces of evidence, that the explanations and numbers for these pieces of evidence don't make sense for the implied story, that you shouldn't try to do explicit probability calculations this way, or some fourth thing?
If this algorithm isn't actually equivalent to Bayes in some way, that would be really useful for someone to point out. At first glance it seems like a simpler (to me anyway) way to express how making updates works, not just on an intuitive "I guess the numbers move that direction?" way but in a way that might not get fooled by e.g. the mammogram example.
If these explanations and numbers don't make exact sense for the implied story, that seems fine? "A train is moving from east to west at a uniform speed of 12 m/s, ten kilometers west a second train is moving west to east at a uniform speed of 15 m/s, how far will the first train have traveled when they meet?" is a fine word problem even if that's oversimplified for how trains work.
If you don't think it's worth doing explicit probability calculations this way, even to practice and try and get better or as a way to train the habit of how the numbers should move, that seems like a different objection and one you would have with any guide to Bayes. That's not to say you shouldn't raise the objection, but that doesn't seem like an objection that someone did the math wrong!
And of course maybe I'm completely missing your point.
Multiple points, really. I believe that this calculation is flawed in specific ways, but I also think that most calculations that attempt to estimate the relative odds of two events that were both very unlikely a priori will end up being off by a large amount. These two points are not entirely unrelated.
The specific problems that I noticed were:
And then the meta-problem: when you're multiplying together more than two or three probabilities that you estimated, particularly small ones, errors in your ability to estimate them start to add up. Which is why I don't think it's usually worthwhile to try and estimate probabilities like this.
But you have a fair point about it being a good idea to practice explicit calculations, even if they're too complicated to reliably get right in real life. So here's how I might calculate it:
P(bear encounters you): 1%.
P(tent scratched | bear): 60%, for the reasons I said above... unless we take into account it scaring away other tent-scratching animals, in which case maybe 40%.
P(tent flipped over | bear & tent scratched): 20%, maybe? I think if the bear has already taken an interest in your tent, it's more likely than usual to flip it over.
P(you see a bear-shaped object | bear & tent scratched & tent flipped over): Bears always look like bears. This is so close to 100% I wouldn't even normally include it in the calculation, but let's call it 99.99%.
P(you get eaten | bear & tent scratched & tent flipped over & you see a bear-shaped object): It's already pretty been aggressive so far, so I'd say perhaps 5%.
On the other side, there are almost no objects for which the probability of it looking exactly like a bear isn't infinitesimal; let's only consider Bigfoot and serial-killer-who's-a-furry for simplicity, then add them up.
P(Bigfoot exists): ...hmm. I am not an expert on the matter, but let's say 1%.
P(Bigfoot encounters you | Bigfoot exists): There can't be that many Bigfoots (Bigfeet?) out there, or else people would have caught one. 0.01%.
P(tent scratched | Bigfoot): Bigfeet are probably more aggressive than bears, so 70%.
P(tent flipped over | Bigfoot): Again, Bigfeet are supposed to be pretty aggressive, so 50%.
P(you see a bear-shaped object | Bigfoot & tent scratched & tent flipped over): Bigfoot looks similar enough to a bear that you'll almost certainly think he's a bear. 99%.
P(you get eaten | Bigfoot & tent scratched & tent flipped over & you see a bear-shaped object): Again, Bigfeet aggressive, 30%.
Then for the furry cannibal one:
P(furry cannibal stalking this forest): 0.000001% (that's one in a hundred million, if I got my zeroes right). I welcome you to prove me wrong on the matter by manually increasing the number of furry cannibals in a given forest.
P(furry cannibal encounters you | furry cannibal exists): How large of a forest is this? Well, he probably has his methods of locating prey, so let's say 10%. Wait, why did I assume he's a "he"? What gender is the typical furry cannibal? Probably a trans woman? Let's name this furry cannibal Susan.
P(tent scratched | Susan): Probably not that high; she doesn't want to wake you up too soon. 30%.
P(tent flipped over | Susan & tent scratched): She might just sneak in, but let's say 90%.
P(you see a bear-shaped object | Susan & tent scratched & tent flipped over): She's wearing a bear costume, as hypothesized; 99.99%.
P(you get eaten | Susan & tent scratched & tent flipped over & you see a bear-shaped object): Yes, of course this happens; this was her whole kink in the first place! 99%.
So for "bear," we have 1%*40%*20%*99.99%*5% = 0.004%. For "Bigfoot," we have 1%*0.01%*70%*50%*99%*30% = 0.00001%. For "Susan," we have 0.000001%*10%*30%*90%*99.99%*99% = .000000027%. Looks like Bigfoot was so much more likely than Susan that we can pretty much just forget the Susan possibility altogether. It's 0.004 to 0.00001, so 400 to 1 chance that you're being eaten by a bear.
(Although I actually think you should be even more confident than 400 to 1 that it's a bear rather than Bigfoot, and that I just was off by an order of magnitude for one reason or another, as happens when you're doing these sorts of calculations. And if you ever actually observe all of these things, the most likely hypothesis is that you're dreaming.)
Here's a visual description: Imagine all worlds, before you see evidence cut into two: YEP and NOPE. The ratio of how many are in each (aka probability mass or size) represents the prior odds. Now, you see some evidence E (e.g. a metal detector beeping), so we want to know the ratio after seeing it.
Each part of the prior cut produces worlds with E (e.g. produces beeps). A YEP produces (Chance of E if YEP) amount of E worlds while a NOPE produces (Chance of E with NOPE).
And thus the new ratio is the product.
In case you don't know what odds are, they express a ratio using a pair of numbers where the overall scale is irrelevant, e.g. 1:2 and 2:4 represent the same ratio. Probabilities are the values when you scale so that the sum over all outcomes is 1, so in this case 1:2 = 1/3 : 2/3 so the probabilities are 1/3, 2/3.
In my opinion, the odds form is the superior form, because it's very easy to use and remember and "philosophically speaking" relative probabilityness is possibly more fundamental. Even at higher levels it's often more practical. I see it as a pedagogical mistake that Bayes theorem is usually first explained in probability form - even on this site! Basic things should be deeply understood by ~everyone.
This seems like a low quality comment. First because there's no sentence or reasoning. Second because you say 0/8, and the voting is out of 9, which leaves me a little confused and wonder whether you mean something else entirely.
The thing to remember is that yeps and nopes never cross. The colon is a thick & rubbery barrier. Yep with yep and nope with nope.
bear : notbear =
1:100 odds to encounter a bear on a camping trip around here in general
* 20% a bear would then scratch my tent : 50% a notbear would
* 10% a bear would then flip my tent over : 1% a notbear would
* 95% a bear would then look exactly like a fucking bear inside my tent : 1% a notbear would
* 0.01% chance a bear would then eat me alive : 0.001% chance a notbear would
As you die you conclude 1*20*10*95*.01 : 100*50*1*1*.001 = 190 : 5 odds that a bear is eating you.