Followup to: Illusion of Transparency: Why No One Understands You, Expecting Short Inferential Distances
A few years ago, an eminent scientist once told me how he'd written an explanation of his field aimed at a much lower technical level than usual. He had thought it would be useful to academics outside the field, or even reporters. This ended up being one of his most popular papers within his field, cited more often than anything else he'd written.
The lesson was not that his fellow scientists were stupid, but that we tend to enormously underestimate the effort required to properly explain things.
He told me this, because I'd just told him about my experience publishing "An Intuitive Explanation of Bayesian Reasoning". This is still one of my most popular, most blogged, and most appreciated works today. I regularly get fan mail from formerly confused undergraduates taking statistics classes, and journalists, and professors from outside fields. In short, I successfully hit the audience the eminent scientist had thought he was aiming for.
I'd thought I was aiming for elementary school.
Today, when I look back at the Intuitive Explanation, it seems pretty silly as an attempt on grade school:
- It's assumed that the reader knows what a "probability" is.
- No single idea requires more than a single example.
- No homework problems! I've gotten several complaints about this.
(Then again, I get a roughly equal number of complaints that the Intuitive Explanation is too long and drawn-out, as that it is too short. The current version does seem to be "just right" for a fair number of people.)
Explainers shoot way, way higher than they think they're aiming, thanks to the illusion of transparency and self-anchoring. We miss the mark by several major grades of expertise. Aiming for outside academics gets you an article that will be popular among specialists in your field. Aiming at grade school (admittedly, naively so) will hit undergraduates. This is not because your audience is more stupid than you think, but because your words are far less helpful than you think. You're way way overshooting the target. Aim several major gradations lower, and you may hit your mark.
PS: I know and do confess that I need to work on taking my own advice.
Addendum: With his gracious permission: The eminent scientist was Ralph Merkle.
This seems like a nice explanation of Bayes rule and of some parts of statistics that are often not well-presented and lead to confusion. But Bayesian inference is more than just Bayes rule. I dont see any discussion of the idea of probability as subjective belief in there, which is the core of a Bayesian view of probability. Bayes rule is just the machinery for updating beliefs, and as important as that is, frequentists use it too. Bayes rule only gains the huge significance when you have committed to the idea of probability as expressing degree of belief, and which makes you free to put probability distributions over parameters
Of course the best way to align your aim and intended aim is feedback. If you have an audience at your target level you can talk to, you can find out via practice what works with them. This is an advantages teachers often have.
Very true, Robin.
Anonreader: Agreed, if I was rewriting it today I'd call it "An Intuitive Explanation of Bayes's Rule". Again, I overshot the mark - I thought that from understanding a single Bayesian update, the Bayesian Way would automatically follow.
when you have committed to the idea of probability as expressing degree of belief
Do you folks think that probability is properly understood only as an expression of degree of belief (and that, say, the other ideas about it are confused, circular, or whatever), or do you recognize the others while primarily being interested in degree of belief?
For my part, while I recognize the usefulness of employing the probability calculus to deal with degree of belief, I don't think it's the only use of the probability calculus. For example, I don't think we gain much by insisting on an interpretation of Buffon's needle in terms of degree of belief. I think it's comprehensible in the context of something like Buffon's needle to derive probability from geometric symmetries without routing through the notion of probability as degree of belief.
In teaching or lecturing, I notice my own tendency, which I have since tried to correct, to avoid going over anything I had known for a long time and only mention the latest tidbit I'd learned. I attributed that to my own desire to avoid boredom and seek out important new insights. Also, through the well-known cognitive bias of projecting one's own mental state on others, I was subconsciously trying not to bore the students.
Oops! (that last post was not intended to test anyone's psychic ability) The problem of Bayesian reasoning is in the setting of prior probability. There is some self correction built in, so it is a better system than most (or any other if you prefer), but a particular problem raises its ugly that is relevant to overcoming bias. Suppose I want to discuss a particular phenomena or idea with a Bayesian. Suppose this Bayesian has set the prior probability of this phenomena or idea at zero. What would be the proper gradient to approach the subject in such a case?
that last post
I found it to be a model of brevity.
Douglas, if your interlocutor is a really consistent Bayesian and has a probability estimate of exactly zero for whatever-it-is then I advise you to talk to someone else instead. If (as is more commonly the case) their prior is merely very small, then what you need to do is to present them with evidence that brings their posterior probability high enough for them to think it worthy of further discussion.
This is in fact exactly the problem you face when discussing anything with anyone (Bayesian or not) who finds your position or some part of it wildly improbable. At least a Bayesian is (in principle) committed to taking appropriate note of evidence.
Constant, I think even the strictest subjective Bayesian should be happy to agree that in the presence of suitable symmetries the only sensible prior may be determined by those symmetries, and that in such cases you can save some mental effort by just talking about the symmetries. Just as you can talk about the axioms of number theory or logic or whatever even if you think they're really empirical generalizations rather than descriptions of Platonic Mathematical Reality, and usually when doing mathematics it's appropriate to do so.
the only sensible prior may be determined by those symmetries, and that in such cases you can save some mental effort by just talking about the symmetries
Okay, but that's really just treating other views of probability as convenient fictions, which isn't quite what I was hoping for.
I do not believe that the subjective interpretation of probability is a good match for the application of probability to the analysis of events that have already occurred. Conduct an experiment, observe (say) a normal distribution in some variable. That we subjectively expected a normal distribution prior to the run of the experiment, or that we now expect it in future runs, is all well and good, but it is not our expectation that accounts for the actual appearance of the normal distribution itself. If anything accounts for the actual appearance of a normal distribution, it is some property of the experimental setup, rather than an expectation of ours.
I am not denying that you can go ahead and talk about our degree of belief and the unique rational way to update that degree of belief. I think that's a perfectly legitimate topic. What I find unconvincing is the idea is that that's all that we can profitably apply the probability calculus to. I did google "bayesian quantum" and didn't find anything that answered my concern, though I did find assertions of the very position that I have a problem with. I don't deny that you can approach quantum mechanics from a bayesian perspective - obviously your beliefs can be informed by quantum mechanics and you can have beliefs about the results of experiments and so on; I simply think it leaves something out, because at the end we have not only our own belief about the result of the next quantum experiment to arrive at rationally based on our priors in combination with experiments we have observed, but also actual observed frequencies of past quantum experiments, and these are out there, they are not subjective, and the mathematical theory of probability strongly recommends itself as a tool in the analysis of the observed frequencies.
You thought elementary school students wouldn't be completely overwhelmed by a question like this?:
"1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
What do you think the answer is? If you haven't encountered this kind of problem before, please take a moment to come up with your own answer before continuing."
Seriously, even down to the use of big words like "mammographies".
Seriously, even down to the use of big words like "mammographies".
"Boob x-ray" would get the attention of half the class. Come to think of it, maybe the whole class, each half for a different reason.
What kind of elementary school did you go to?
Here is an excellent example in linguistics - my fiancee just sent this to me:
This link now goes to the main page of Birmingham university. Do you still have the thing you were trying to link to?
The Internet Archive knows all: http://web.archive.org/web/20081021081136/http://www.english.bham.ac.uk/who/myversion.htm
Douglas writes: Suppose I want to discuss a particular phenomena or idea with a Bayesian. Suppose this Bayesian has set the prior probability of this phenomena or idea at zero. What would be the proper gradient to approach the subject in such a case?
I would ask them for their records or proof. If one is a consistent Bayesian who expects to model reality with any accuracy, the only probabilities it makes sense to set as zero or one are empirical facts specificied at a particular point in space-time (such as: "I made X observation of Y on Z equipment at W time") or statements within a formal logical system (which are dependent on assumptions and can be proved from those assumptions).
Even those kinds of statements are probably not legitimate candidates for zero/one probability, since there is always some probability, however minuscule that we have misremembered, misconstrued the evidence or missed a flaw in our proof. But I believe these are the only kinds of statements which can, even in principle have probabilities of zero or 1.
All other statements run up against possibilities for error that seem (at least to my understanding) to be embedded in the very nature of reality.
I took a glance at http://www.english.bham.ac.uk/who/myversion.htm ...
Wow, that's dense writing. I could follow it, but it took effort and I didn't care to put the kind of energy into studying it as one does with college textbooks.
The English link didn't seem particularly bad to me - its flaw was that it was deliberately too difficult for its subject matter, not that it was difficult in an absolute sense.
Josh, I don't think I'd have had any trouble following that problem at age 7, which is when I was taught to solve systems of equations.
Self-anchoring. I adjusted because I knew other grade-schoolers weren't as good at math, but, of course, I vastly underadjusted. I really had absolutely no concept of what it meant to not be good at math until a couple of years ago, and I probably still have no concept of it.
I can't speak for other Bayesians, but I prefer to use the idea of Bayesian probabilities encode "states of information" as opposed to "degrees of belief". To me, overcoming bias means making sure your beliefs reflect the actual information available to you, so I prefer to use a phrase which directs attention to that information immediately. This isn't my idea; it's one of the key ideas put forward by E. T. Jaynes. To get a sense of how this works in a geometric problem similar to the Buffon needle problem, I recommend Jaynes's paper The Well-Posed Problem.
Constant- thank-you. Micheal- I was indocrinated into Bayes many years ago. I agree that the probability 0 is not a rational one. (Who would have guessed in 1900 that the things that seemed most certain were wrong?) Or perhaps I should say that the probability 0 (or 1) is not a scientific attitude- science is based on looking to know- the assumption on probability 0 is the excuse to not look. I'm thinking that is a difference between religion and science- science has to be wrong (so that it can advance) whereas religion has to be right (to be worthy of total faith). Hmmm, I like that.
You are aware that you are exceptional, aren't you? If I remember correctly, that question is more difficult than anything on the math SAT on which very few high schoolers get perfect scores. I agree with you about not understanding what it means to be "bad" at math. This makes it difficult to assess when one simple problem is more difficult than another. I may be wrong, but that question certainly feels like it would be difficult for most high schoolers let alone elementary schoolers.
Not only is that question definitely harder than anything on the SAT, but being asked to solve a problem before being taught how to do it would be nearly unheard of in any of my math classes. (I also notice "the usual precedence rules" are mentioned further down the page - I don't recall seeing those in school until 6th grade, in an advanced class.)
An important transformation of probabilities is the log odds, the logarithm of the odds ratio of the probability, log(p/(1-p)). This has the advantage that you can simply add the log-likelihood to your log-odds prior to get the revised probability. Bayes becomes addition.
In the log odds, 1 comes out as positive infinity and 0 comes out as negative infinity.
Negative and positive infinity are not real numbers, and 0 and 1 are not probabilities.
Eliezer, you are right. One word for probability one is "certainty" and a word for probability zero is "impossible". Could we say then that with the exception of a situation of perfect, complete knowledge of conditions (a situation that may or may not actually exist in reality) that the Bayesian worldveiw would not include those words?
"One word for probability one is "certainty" and a word for probability zero is "impossible"." I think you should be cautious about using these words like this, at least if you might be talking about uncountable probability spaces. Using your definitions it is certain that (say) a normally distributed variable takes a value in the real numbers but impossible for it to take any such value.
I hope this isn't unfairly pedantic to point out. I can see that one could argue that for decision making in the real world you only assign probabilities to finitely many outcomes.
Matt- I'm seldom careful. The advantages of being carefree are too numerous to list, but one of the disadvantages is that I have to admit mistakes. You are not being overly pedantic. I'd probably make a lousy Bayesian. (I wonder what the prior probability of that is?) By the way, what do you think of a decision making protocol that assumes that the data gathering is random?
Which paper was Merkle talking about, if I may ask?
I'm curious too. Anyone know which paper this is?
Given the information that it was Ralph Merkle, that it was about his field (=cryptography), that it was intended to be a general overview (so about cryptography broadly construed and not a single result or breakthrough), aimed at outsiders (so little math), it was highly cited (more than most of his papers), and no co-author is mentioned (and you wouldn't expect one for an 'explainer' like that), you should be able to make a good guess with a few seconds in Google Scholar after sorting his papers by citation-count and skimming the titles & summaries of each paper: https://scholar.google.com/scholar?start=10&q=author:%22ralph+merkle%22&hl=en&as_sdt=0,21
My guess would be that it's hit #8 on page 1, "Secure communications over insecure channels", Merkle 1978, written in 1975 simultaneously with his breakthrough work on public-key crypto, which indeed would need to be explained to a lot of people at the time. It's written in a very conversational tone, with historical background and discussion of practical issues and solutions like sending secret keys in the mail, extremely few equations or math (but imperative program pseudocode instead), published in a more general interest publication than the usual cryptography journals, and despite not presenting any new results & being ancient, is still apparently his 8th most cited paper ever out of 162 hits for him as author.
This explains why so many text books are so badly written. The authors were aiming too high.
Oh bugger. I just got linked here while preparing a talk for next Tuesday.
I found a comment on HN adding more evidence to this phenomenon:
"Ten Lessons I wish I Had Been Taught", Gian-Carlo Rota 1997:
"Ten Lessons for the Survival of a Mathematics Department"%20ten%20lessons%20for%20the%20survival%20of%20a%20mathematics%20department.pdf):
Correct link: https://www.yudkowsky.net/rational/bayes