**Related to**: Bayes' Theorem Illustrated, What is Bayesianism?, An Intuitive Explanation of Bayes' Theorem

*(Bayes' theorem is something Bayesians need to use more often than Frequentists do, but Bayes' theorem itself isn't Bayesian. This post is meant to be a light introduction to the difference between Bayes' theorem and Bayesian data analysis.)*

**Bayes' Theorem**

Bayes' theorem is just a way to get (e.g.) p(B|A) from p(A|B) and p(B). The classic example of Bayes' theorem is diagnostic testing. Suppose someone either has the disease (D^{+}) or does not have the disease (D^{-}) and either tests positive (T^{+}) or tests negative (T^{-}). If we knew the sensitivity P(T^{+}|D^{+}), specificity P(T^{-}|D^{-}) and disease prevalence P(D^{+}), then we could get the positive predictive value P(D^{+}|T^{+}) using Bayes' theorem:

For example, suppose we know the sensitivity=0.9, specificity=0.8 and disease prevalence is 0.01. Then,

This answer is not Bayesian or frequentist; it's just correct.

**Diagnostic testing study**

Typically we will not know P(T^{+}|D^{+}) or P(T^{-}|D^{-}). We would consider these unknown parameters. Let's denote them by Θ_{sens} and Θ_{spec}. For simplicity, let's assume we know the disease prevalence P(D^{+}) (we often have a lot of data on this).

Suppose 1000 subjects with the disease were tested, and 900 of them tested positive. Suppose 1000 disease-free subjects were tested and 200 of them tested positive. Finally, suppose 1% of the population has the disease.

**Frequentist approach**

Estimate the 2 parameters (sensitivity and specificity) using their sample values (sample proportions) and plug them in to Bayes' formula above. This results in a point estimate for P(D^{+}|T^{+}) of 0.043. A standard error or confidence interval could be obtained using the delta method or bootstrapping.

Even though Bayes' theorem was used, this is not a Bayesian approach.

**Bayesian approach**

The Bayesian approach is to specify prior distributions for all unknowns. For example, we might specify independent uniform(0,1) priors for Θ_{sens} and Θ_{spec}. However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T^{+}). In addition, we expect Θ_{sens}>1-Θ_{spec}. So, I might go with a Beta(4,2.5) distribution for Θ_{sens }and Beta(2.5,4) for Θ_{spec}:

Using these priors + the data yields a posterior distribution for P(D+|T+) with posterior median 0.043 and 95% credible interval (0.038, 0.049). In this case, the Bayesian and frequentist approaches have the same results (not surprising since the priors are relatively flat and there are a lot of data). However, the methodology is quite different.

**Example that illustrates benefit of Bayesian data analysis**

(example edited to focus on credible/confidence intervals)

Suppose someone shows you what looks like a fair coin (you confirm head on one side tails on the other) and makes the claim: "This coin will land with heads up 90% of the time"

Suppose the coin is flipped 5 times and lands with heads up 4 times.

**Frequentist approach**

"A 95% confidence interval for the Binomial parameter is (.38, .99) using the Agresti-Coull method." Because 0.9 is within the confidence limits, the usual conclusion would be that we do not have enough evidence to rule it out.

**Bayesian approach**

"I don't believe you. Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5. However, I don't want to rule out something a little bit unusual (like a probability of 0.4). Thus, my prior for the probability of heads is a Beta(30,30) distribution."

After seeing the data, we update our belief about the binomial parameter. The 95% credible interval for it is (0.40, 0.64). Thus, a value of 0.9 is still considered extremely unlikely.

This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims. Frequentists have no formal way of including that type of prior information.

I see so much on the site about Bayesian probability. Much of my current work uses Dempster-Shafer theory, which I haven't seen mentioned here.

DST is a generalization of Bayesian probability, and both fuzzy logic and Bayesian inference can be perfectly derived from DST. The most obvious difference is that DST parameterizes confidence, so that a 0.5 prior with no support is treated differently than a 0.5 prior with good support. For my work, the more important aspect is that DST is more forgiving when my sensors lie to me; it handles conflicting evidence more gracefully, as long as its results are correctly interpreted (in my opinion they are less intuitive than strict probabilities).

Joseph Felsenstein is a pioneer in the use of maximum likelihood methods in evolutionary biology. In his book, "Inferring Phylogenies", he has a chapter on Bayesian methods, and a section in that chapter on controversies over Bayesian inference. He discusses a toy example of a space probe to Mars which looks for little green men and doesn't find them. He wonders whether a scientist whose prior for little green men involved odds of 1/4, and who, based on the evidence of the space probe, now assigns odds of 1/12, should publish those revised odds. He writes:

Felsenstein apparently defines himself as a "likelihoodist" rather than a "frequentist" or "Bayesian".

"Likelihoodist" is so clunky and dull. I prefer "likelihoodlum" -- it's just as clunky, but at least it's somewhat inflammatory.

Good point. It would be irresponsible to publish a news item that "the Prime Minister's support for this bill is three times more likely if he is, in fact, a lizard alien than if he is a human" without noting that the prior probability for him being a lizard alien is pretty low.

Good post. It might be worth mentioning, after noting the benefit of the Bayesian approach, the deep problem with it: the subjectivity of the prior. This can be illustrated by looking at one of the above paragraphs:

It's fine for Bayesian A to choose a prior based on his experience, fe... (read more)

Except that they're actually using different evidence, because the intuitive feelings that go into a prior are, in fact, evidence. More specifically, the prior is the sum of the outputs of all the heuristics that aren't quite strong or reliable enough to reach conscious awareness, but which nevertheless do, and ought to, influence the probability judgment.

I like this post, there's still a lot of confusion around Bayesian methods.

Two things that would have helped me while I was learning Bayesianism were that:

and

I might write these into a post sometime.

*This is what's going wrong in the heads of people who say things like "The probability is either 1 or 0, but I don't know which."

Is the idea that the coin will land heads 90% of the time really something that can be called the "null hypothesis"?

Statistically, there's nothing wrong with the null hypothesis being p=0.9. It's probably not a test you would see very often in practice because usually there is nothing interesting about p=0.9. But if you wanted to test whether or not p=0.9 for some reason - any reason, setting the null hypothesis as p=0.9 is a perfectly valid (frequentist) way of doing it.

I don't quite get how you got your distributions in the first example. Since getting the prior distributions seems to be where the magic happens, further explaination would be helpful.

THANK YOU! That's the best explanation I've ever seen of the difference. I don't know if it's right; but at least it's making a coherent claim.

Can you spell out how the computation is done with the priors in the Bayesian case?

Quibble:

Guessing that everyone is T- would have a lower error rate.

May be relevant here: my post and Wei Dai's post.

Yeah, the best they can do is require a stricter significance level. (IIRC, the justification for the standard "statistical significance" test involves something like taking the position of maximum ignorance by assuming that the null hypothesis is 50% likely to be true and then applying Bayes's Theorem to get a posterior probability...)

The key feature of Bayesian probability is the concept that probabilities are degrees of belief. The reason that Bayes' Theorem is associated with the Bayesian approach is that once you are willing, unlike frequentists, to have a probability that a coin comes up heads in 90% of flips, you can use Bayes' Theorem to update that probability on observing the evidence of 4 flips resulting in heads and 1 flip resulting in tails.

neq1,

Thanks for the great learning. One apparent inconsistency puzzled me a bit:

You go out of your way to say that you expect Θsens>1-Θspec (which looks sensible to me), then follow up by choosing a pair of priors which, assuming independence, appear to violate this expectation a significant fraction of the time - perhaps 20-25%, just eyeballing the distributions.

Can you comment on the apparent inconsistency, or straighten out my misconceptions?

Great post, thanks for the learning. I am a bit puzzled however as to why you go out of your way to say "In addition, we expect Θsens>1-Θspec," (which expectation BTW makes sense to me), only to follow this up by assuming priors for Θsens and Θspec which (assuming independence) violate this expectation with significant probability (perhaps 25% of the time, based on a crude eyeballing of the curves).

Can you help me better understand your thinking here (or clear up any misconception on my part)?

Wait a minute - when the Bayesian says "I think the coin probably has a chance near 50% of being heads", she's using data from prior observations of coin flips to say that. Which means that the frequentist might get the same answer if he added those prior observations to his dataset.

There's really no contradiction between the classical and Bayesian results:

They are saying different things.

Wow, this really brings home the arbitrary nature of the Bayesian approach. If we're trying to get an AI to determine what to do, it can't guess meaningful priors (and neither can we come to that). I presume when it is applied there is a load of theoretical approaches to prior model estimation or is a uniform prior just used as default? In which case are there other occasions when a frequentist and bayesians probability estimates differ?

I asked this before (though perhaps with the wrong symbols), so I'll try it a different way.

Someone please evaluate this statement:

?equals(correct reasoning, Bayesian inference)