by neq1

# 49

(Bayes' theorem is something Bayesians need to use more often than Frequentists do, but Bayes' theorem itself isn't Bayesian. This post is meant to be a light introduction to the difference between Bayes' theorem and Bayesian data analysis.)

Bayes' Theorem

Bayes' theorem is just a way to get (e.g.) p(B|A) from p(A|B) and p(B).  The classic example of Bayes' theorem is diagnostic testing.  Suppose someone either has the disease (D+) or does not have the disease (D-) and either tests positive (T+) or tests negative (T-).  If we knew the sensitivity P(T+|D+), specificity P(T-|D-) and disease prevalence P(D+), then we could get the positive predictive value P(D+|T+) using Bayes' theorem:

For example, suppose we know the sensitivity=0.9, specificity=0.8 and disease prevalence is 0.01.  Then,

This answer is not Bayesian or frequentist; it's just correct.

Diagnostic testing study

Typically we will not know P(T+|D+) or P(T-|D-).  We would consider these unknown parameters.  Let's denote them by Θsens and Θspec.  For simplicity, let's assume we know the disease prevalence P(D+) (we often have a lot of data on this).

Suppose 1000 subjects with the disease were tested, and 900 of them tested positive.  Suppose 1000 disease-free subjects were tested and 200 of them tested positive.  Finally, suppose 1% of the population has the disease.

Frequentist approach

Estimate the 2 parameters (sensitivity and specificity) using their sample values (sample proportions) and plug them in to Bayes' formula above.  This results in a point estimate for P(D+|T+) of 0.043.  A standard error or confidence interval could be obtained using the delta method or bootstrapping.

Even though Bayes' theorem was used, this is not a Bayesian approach.

Bayesian approach

The Bayesian approach is to specify prior distributions for all unknowns.  For example, we might specify independent uniform(0,1) priors for Θsens and  Θspec.  However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T+). In addition, we expect Θsens>1-Θspec. So, I might go with a Beta(4,2.5) distribution for Θsens and Beta(2.5,4) for Θspec:

Using these priors + the data yields a posterior distribution for P(D+|T+) with posterior median 0.043 and 95% credible interval (0.038, 0.049).  In this case, the Bayesian and frequentist approaches have the same results (not surprising since the priors are relatively flat and there are a lot of data).  However, the methodology is quite different.

Example that illustrates benefit of Bayesian data analysis

(example edited to focus on credible/confidence intervals)

Suppose someone shows you what looks like a fair coin (you confirm head on one side tails on the other) and makes the claim:  "This coin will land with heads up 90% of the time"

Suppose the coin is flipped 5 times and lands with heads up 4 times.

Frequentist approach

"A 95% confidence interval for the Binomial parameter is (.38, .99) using the Agresti-Coull method."  Because 0.9 is within the confidence limits, the usual conclusion would be that we do not have enough evidence to rule it out.

Bayesian approach

"I don't believe you.  Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5.  However, I don't want to rule out something a little bit unusual (like a probability of 0.4).  Thus, my prior for the probability of heads is a Beta(30,30) distribution."

After seeing the data, we update our belief about the binomial parameter.  The 95% credible interval for it is (0.40, 0.64).  Thus, a value of 0.9 is still considered extremely unlikely.

This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims.  Frequentists have no formal way of including that type of prior information.

# 49

New Comment

I see so much on the site about Bayesian probability. Much of my current work uses Dempster-Shafer theory, which I haven't seen mentioned here.

DST is a generalization of Bayesian probability, and both fuzzy logic and Bayesian inference can be perfectly derived from DST. The most obvious difference is that DST parameterizes confidence, so that a 0.5 prior with no support is treated differently than a 0.5 prior with good support. For my work, the more important aspect is that DST is more forgiving when my sensors lie to me; it handles conflicting evidence more gracefully, as long as its results are correctly interpreted (in my opinion they are less intuitive than strict probabilities).

Joseph Felsenstein is a pioneer in the use of maximum likelihood methods in evolutionary biology. In his book, "Inferring Phylogenies", he has a chapter on Bayesian methods, and a section in that chapter on controversies over Bayesian inference. He discusses a toy example of a space probe to Mars which looks for little green men and doesn't find them. He wonders whether a scientist whose prior for little green men involved odds of 1/4, and who, based on the evidence of the space probe, now assigns odds of 1/12, should publish those revised odds. He writes:

It might be argued that the correct thing to do in such a case is to publish the likelihood ratio 1/3 and let the reader provide their own prior. This is the likelihoodist position. A Bayesian is defined, not by using a prior, but by being willing to use a controversial prior.

Felsenstein apparently defines himself as a "likelihoodist" rather than a "frequentist" or "Bayesian".

[-]Cyan190

"Likelihoodist" is so clunky and dull. I prefer "likelihoodlum" -- it's just as clunky, but at least it's somewhat inflammatory.

7TobyBartels
There are two slightly different meanings of what it is to be a "Bayesian": philosophically, there is a Bayesian interpretation of probability theory, and practically, there are Bayesian methods in statistics. I see Felsenstein as saying that, even if one is a Bayesian philosophically, one ought to practise as a likelihoodist. In original research, I agree; there is not much point in reporting posteriors. Certainly there's no point in reporting them without also reporting the original priors, but better just to report the likelihoods and let readers supply their own priors. On the other hand, in summaries for a broad readership, the posteriors are the most important result to report. Now most readers don't have the expertise to bring their own priors, so you have to give them yours. And then do the calculation for them.

On the other hand, in summaries for a broad readership, the posteriors are the most important result to report. Now most readers don't have the expertise to bring their own priors, so you have to give them yours. And then do the calculation for them.

Good point. It would be irresponsible to publish a news item that "the Prime Minister's support for this bill is three times more likely if he is, in fact, a lizard alien than if he is a human" without noting that the prior probability for him being a lizard alien is pretty low.

0pozorvlak
And yet they do this all the frigging time in medical stories, as documented extensively on, for instance, Bad Science.
4Perplexed
I'm curious, though, as to what all you are giving up by not talking about priors. In Felsenstein's field - roughly, constructing best estimates of the "tree of life" - you very frequently have prior information which you want to bring to the problem, but of course you don't want to bring in any prior information which is not neutral on the controversial issue that your study is supposed to shed light on. One of the advantages of a Bayesian methodology is supposed to be the ability to combine information from sources with different qualities and coverages. To what extent are you prevented from doing that if you insist on doing all of your likelihood ratio work behind a "veil of ignorance"?
1TobyBartels
5Eliezer Yudkowsky
"Maximum likelihood" totally != "report likelihood ratios".
6Perplexed
Yes, I know, as I'm sure does Felsenstein. The book covered much more than maximum likelihood. The recommendation to report likelihood ratios came in the first of two chapters on Bayesian methods. The second involved hidden Markov models. The book begins (as does the field) with a tree-building method called 'maximum parsimony'. Maximum likelihood is a step up in sophistication from that, and Felsenstein is largely responsible for that step forward. I'm not really sure why he is not an enthusiastic Bayesian. My guess would be that it is because he is a professional statistician and the whole discipline of statistics traditionally consists of ways of drawing totally objective conclusions from data.
5JGWeissman
This position has also been expressed here. (Should this article be migrated from OB to LW?)

Good post. It might be worth mentioning, after noting the benefit of the Bayesian approach, the deep problem with it: the subjectivity of the prior. This can be illustrated by looking at one of the above paragraphs:

Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5. However, I don't want to rule out something a little bit unusual (like a probability of 0.4).

It's fine for Bayesian A to choose a prior based on his experience, fe...

It's fine for Bayesian A to choose a prior based on his experience, feelings, and intuitive reasoning. But it's equally reasonable for Bayesian B to assert a completely different prior for her own reasons. Then they will come to different conclusions based on identical evidence.

Except that they're actually using different evidence, because the intuitive feelings that go into a prior are, in fact, evidence. More specifically, the prior is the sum of the outputs of all the heuristics that aren't quite strong or reliable enough to reach conscious awareness, but which nevertheless do, and ought to, influence the probability judgment.

I like this post, there's still a lot of confusion around Bayesian methods.

Two things that would have helped me while I was learning Bayesianism were that:

The frequency with which a coin comes up heads isn't a probability, no matter how much it looks like one.*

and

Bayes theorem doesn't do parameter estimation.

I might write these into a post sometime.

*This is what's going wrong in the heads of people who say things like "The probability is either 1 or 0, but I don't know which."

3TobyBartels
Pedantry alert: This is not technically true, although it's still a very important point. Every frequency is the probability of something; in this case, the frequency with which the coin comes up heads is the probability, given that you pick one of the times that the coin is flipped, that coin comes up heads that time. But this is not the same thing as the probability that the coin comes up heads the next time that you flip it, which is what you are more likely to be interested in (and which people are liable to uselessly claim is "either 1 or 0, but I don't know which").
3wedrifid
Even more pedantic: It still isn't a probability. There is a probability p(heads | you picked one of those coins) and it can be found by simply taking the frequency. But the frequency still doesn't mean the probability. In much the same way 5 / 100 balls in the jar being red isn't a probability. It is a curious fact about the colors of balls in the jar. p(ball is red | I take a ball from that jar) is a probability.
6TobyBartels
Our think that our pedantries are clashing on the word "is". I'm thinking of both frequencies and probabilities as numbers, and using "is" between them if they are equal numbers. You are (I guess) thinking of frequencies and probabilities are things of different types, which are not numbers even though they may be measured by numbers. Come to think of it, your interpretation is more pedantic than mine, so I concede.
0Oscar_Cunningham
Thinking about it further, there is no probability which is even numerically equal to the frequency. Probabilities are subjective, you know them or can work them out in your head. But you don't know the frequency, so it can't be equal to any of the probabilities in your head (except by coincidence).
2TobyBartels
I think that it's a mistake to reserve the term ‘probability’ for beliefs held by actual people (or other beings with beliefs). In fact, since actual people are subject to such pervasive epistemic biases (such as we try to overcome here), I doubt that anybody (even readers of Less Wrong) holds actual beliefs that obey the mathematical laws of probability. I prefer to think of probabiliy as the belief of an ideal rational being with given information / evidence / observations. (This makes me what they call an ‘objective Bayesian’, although really it just pushes the subjectivity back to the level of information.) So even if nobody knows the frequency with which a given coin comes up heads (which is certainly true if the coin is still around and may be flipped in the future), I can imagine a rational being who knows that frequency. But in a post that was supposed to be pedantic, I was remiss in not specifying exactly what information the probability depends on!
1Oscar_Cunningham
Thanks, this clears some things up for me.
0TobyBartels
You're welcome!
1cousin_it
I don't understand how you can hold a position like that and still enjoy the post. How do you parse the phrase "my prior for the probability of heads" in the second example?
4neq1
In the second example the person was speaking informally, but there is nothing wrong with specifying a probability distribution for an unknown parameter (and that parameter could be a probability for heads)
2Oscar_Cunningham
I hadn't seen that, but you're right that that sentence is wrong. "Probability" should have been replaced with "frequency" or something. A prior on a probability would be a set of probabilities of probabilities, and would soon lead to infinite regress.
2neq1
only if you keep specifying hyper-priors, which there is no reason to do
0Oscar_Cunningham
Exactly. There's no point in the first meta-prior either.

If the null hypothesis was true, the probability that we would get 3 heads or less is 0.08

Is the idea that the coin will land heads 90% of the time really something that can be called the "null hypothesis"?

Statistically, there's nothing wrong with the null hypothesis being p=0.9. It's probably not a test you would see very often in practice because usually there is nothing interesting about p=0.9. But if you wanted to test whether or not p=0.9 for some reason - any reason, setting the null hypothesis as p=0.9 is a perfectly valid (frequentist) way of doing it.

0neq1
Hm, good point. Since the usual thing is .5, the claim should be the alternative. I was thinking in terms of trying to reject their claim (which it wouldn't take much data to do), but I do think my setup was non-standard. I'll fix it later today
[-][anonymous]50

I don't quite get how you got your distributions in the first example. Since getting the prior distributions seems to be where the magic happens, further explaination would be helpful.

THANK YOU! That's the best explanation I've ever seen of the difference. I don't know if it's right; but at least it's making a coherent claim.

Can you spell out how the computation is done with the priors in the Bayesian case?

Quibble:

However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T+).

Guessing that everyone is T- would have a lower error rate.

Guessing that everyone is T- results in a 100% false negative rate, which although not much better than a 99% false negative rate, might more than make up for a 1% decrease in the false positive rate. If this is a real cancer test, and the researcher is optimizing a balance between false positives and false negatives, where would you prefer that he or she place that balance? A lot of medical tests have intentionally very low false negative rates even if that means they have proportionally much higher false positive rates (than they would if they were optimizing for a different balance).

May be relevant here: my post and Wei Dai's post.

This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims. Frequentists have no formal way of including that type of prior information.

Yeah, the best they can do is require a stricter significance level. (IIRC, the justification for the standard "statistical significance" test involves something like taking the position of maximum ignorance by assuming that the null hypothesis is 50% likely to be true and then applying Bayes's Theorem to get a posterior probability...)

The key feature of Bayesian probability is the concept that probabilities are degrees of belief. The reason that Bayes' Theorem is associated with the Bayesian approach is that once you are willing, unlike frequentists, to have a probability that a coin comes up heads in 90% of flips, you can use Bayes' Theorem to update that probability on observing the evidence of 4 flips resulting in heads and 1 flip resulting in tails.

neq1,

Thanks for the great learning. One apparent inconsistency puzzled me a bit:

You go out of your way to say that you expect Θsens>1-Θspec (which looks sensible to me), then follow up by choosing a pair of priors which, assuming independence, appear to violate this expectation a significant fraction of the time - perhaps 20-25%, just eyeballing the distributions.

Can you comment on the apparent inconsistency, or straighten out my misconceptions?

Great post, thanks for the learning. I am a bit puzzled however as to why you go out of your way to say "In addition, we expect Θsens>1-Θspec," (which expectation BTW makes sense to me), only to follow this up by assuming priors for Θsens and Θspec which (assuming independence) violate this expectation with significant probability (perhaps 25% of the time, based on a crude eyeballing of the curves).

Can you help me better understand your thinking here (or clear up any misconception on my part)?

Wait a minute - when the Bayesian says "I think the coin probably has a chance near 50% of being heads", she's using data from prior observations of coin flips to say that. Which means that the frequentist might get the same answer if he added those prior observations to his dataset.

0neq1
Yes, that's a good point. Tthat would be considered using a data augmentation prior (Sander Greenland has advocated such an approach).
0orthonormal
You can dismiss this objection by replacing the coin with a novel experimental test with an easily computed expected probability of success– say, the very first test of spin-up vs. spin-down for silver atoms. Frequentists can't claim relevant data sets for every experiment that has an obvious prior, without engaging in their own form of reference class tennis.
0PhilGoetz
How can they have an obvious prior without an obvious relevant data set?

There's really no contradiction between the classical and Bayesian results:

• Classical: we do not have enough evidence to rule out the claim about the coin's bias.
• Bayesian: the claim about the coin's bias is still considered extremely unlikely.

They are saying different things.

2PhilGoetz
If the 95% credible intervals are different, I think they are in contradiction.
3Douglas_Knight
The frequentist does not supply a credible interval, but a confidence interval. The credible interval has to do with the posterior P(H|D) (and thus the prior P(H)), which the frequentist refuses to talk about. The confidence interval has to do P(D|H) (with p-values and maybe likelihood ratios). Scientists often treat confidence intervals as credible intervals, and thus are wrong, but frequentist statisticians are more sophisticated. I'm not clear on what they say, though. One advantage of the confidence interval is that it makes it clear that the test wasn't that powerful and that the bayesian is relying on the prior. The bayesian is not going to change the mind of the owner of the coin, who clearly has a different prior. Incidentally, if someone claimed a coin produced 90% heads, I'd update away from such a sharply peaked distribution.

Wow, this really brings home the arbitrary nature of the Bayesian approach. If we're trying to get an AI to determine what to do, it can't guess meaningful priors (and neither can we come to that). I presume when it is applied there is a load of theoretical approaches to prior model estimation or is a uniform prior just used as default? In which case are there other occasions when a frequentist and bayesians probability estimates differ?

4DSimon
Sure, if the priors are arbitrary, the Bayesian approach's output is arbitrary. But if the priors make sense, the Bayesian approach works. Or in other words: just like any other algorithm good or bad, GIGO.
6wedrifid
Some would suggest that taking arbitrary priors isn't the Bayesian approach.
-1JohnDavidBustard
I suppose the question is, how to calculate the priors so they do make sense. In particular, how can an AI estimate priors. I'm sure there is a lot of existing work on this. The problem with making statements about priors that don't have a formal process for their calculation is that there is no basis for comparing two predictions. In the worst case, by adjusting the prior the resulting probabilities can be adjusted to any value. Making the approach a formal technique which is potentially just hiding the unknowns in the priors. In effect being no more reasonable because the priors are a guess.
3satt
There is. For example, one can use the Jeffreys prior, which has the desirable property of being invariant under different parametrization choices, or one can pick a prior according to the maximum entropy principle, which says to pick the prior with the greatest entropy that satisfies the model constraints. I don't know if anyone's come up with a meta-rationale that justifies one of these approaches over all others (or explains when to use different approaches), though.
1JohnDavidBustard
Thank you, this is very interesting. I'm not sure of the etiquette, but I'm reposting a question from an old article, that I would really appreciate your thoughts on. Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones? If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of. In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely ("that's a little too neat"). Indeed when stated informally the phrase might go "in my experience, that's a little too neat" implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular 'maker' (i.e. politics). However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.
3satt
With the disclaimer that I'm no expert and quite possibly wrong about some of this, here goes. No. Or, at least, that's not the conscious motivation for the maximum entropy principle (MAXENT). As I see it, the justification for MAXENT is that entropy measures the "uncertainty" the prior represents, and we should choose the prior that represents greatest uncertainty, because that means assuming the least possible additional information about the problem. Now, it does sometimes happen that MAXENT tells you to pick a prior with what I'd guess you think of as "simpler structure". Suppose you're hiding in your fist a 6-sided die I know nothing about, and you ask me to give you my probability distribution for which side'll come up when you roll it. As I know nothing about the die, I have no basis for imposing additional constraints on the problem, so the only operative constraint is that P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1; given just that constraint, MAXENT says I should assign probability 1/6 to each side. In that particular case, MAXENT gives a nice, smooth, intuitively pleasing result. But if we impose a new constraint, e.g. that the expected value of the die roll is 4.5 (instead of the 3.5 implied by the uniform distribution), MAXENT says the appropriate probability distribution is {0.054, 0.079, 0.114, 0.165, 0.240, 0.348} for sides 1 to 6 respectively (from here), which doesn't look especially simple to me. So for all but the most basic problems, I expect MAXENT doesn't conform to the "simpler structures" heuristic. There is probably some definition of "simple" or "complex" that would make your heuristic equivalent to MAXENT, but I doubt it'd correspond to how we normally think of simplicity/complexity.
1JohnDavidBustard
Thank you, that's very interesting, and comforting.
2jsalvatier
In statistics, I think 'weakly informative priors' are becoming more popular. Weakly informative priors are distributions like a t distribution (or normal) with a really wide standard deviation and low degrees of freedom. This allows us to avoid spending all out data on merely narrowing down the correct order of order of magnitude, which can be a problem in many problems using non-informative priors. It's almost never the case that we literally know nothing prior to the data.
1Matt_Simpson
Using a normal with a massive variance is also a standard hack for getting a proper "uninformative" prior on the real line.
3TobyBartels
Hopefully an AI will be able to get its hands on large amounts of data. Once it has that, it doesn't matter very much what its priors were.
1Jonathan_Graehl
Agreed, but the priors can in principle be strong enough that hypothesis A will always be favored over B no matter how much data you have, even though B gives an orders of magnitude higher P(data|B) than P(data|A).
0JohnDavidBustard
Is there a bound on the amount of data that is necessary to adjust a prior of a given error magnitude? Likewise, if the probability is the result of a changing system I presume it could well be the case that the pdf estimates will be consistently inaccurate as they are constantly adjusting to events whose local probability is changing. Does the Bayesian approach help, over say, model fitting to arbitrary samples? Is it, in effect, an example of a model fitting strategy no more reasonable than any other?
[-]Clippy-30

I asked this before (though perhaps with the wrong symbols), so I'll try it a different way.