LESSWRONG
LW

Wikitags

Bayes' rule: Functional form

Edited by Eliezer Yudkowsky, So8res, et al. last updated 11th Oct 2016
Requires: , , ,
Teaches:

generalizes to continuous , and states, "The probability density is proportional to the function times the probability density."

P(Hx∣e)∝Le(Hx)⋅P(Hx)

Example

Suppose we have a biased coin with an unknown bias b between 0 and 1 of coming up heads on each individual coinflip. Since the bias b is a continuous variable, we express our beliefs about the coin's bias using a probability density function P(b), where P(b)⋅db is the probability that b is in the interval [b+db] for db small. (Specifically, the probability that b is in the interval [a,b] is ∫baP(b)db.)

By hypothesis, we start out completely ignorant of the bias b, meaning that all initial values for b are equally likely. Thus, P(b)=1 for all values of b, which means that P(b)db=db (e.g., the chance of b being found in the interval from 0.72 to 0.76 is 0.04).

plot y = 1 + x * 0, x = 0 to 1

We then flip the coin, and observe it to come up tails. This is our first piece of evidence. The likelihood Lt1(b) of observation t1 given bias b is a continuous function of b, equal to 0.4 if b=0.6, 0.67 if b=0.33, and so on (because b is the probability of heads and the observation was tails).

Graphing the likelihood function Lt1(b) as it takes in the fixed evidence t1 and ranges over variable b, we obtain the straightforward graph Lt1(b)=1−b.

plot y = 1 - x, x = 0 to 1

If we multiply the likelihood function by the prior probability function as it ranges over b, we obtain a relative probability function on the posterior, O(b∣t1)=Lt1(b)⋅P(b)=1−b, which gives us the same graph again:

plot y = 1 - x, x = 0 to 1

But this can't be our posterior probability function because it doesn't integrate to 1. ∫10(1−b)db=12. (The area under a triangle is half the base times the height.) this relative probability function will give us the posterior probability function:

P(b∣t1)=O(b∣t1)∫10O(b∣t1)db=2⋅(1−f)

plot y = 2(1 - x), x = 0 to 1

The shapes are the same, and only the y-axis labels have changed to reflect the different heights of the pre-normalized and normalized function

Suppose we now flip the coin another two times, and it comes up heads then tails. We'll denote this piece of evidence h2t3. Although these two coin tosses pull our beliefs about b in opposite directions, they don't cancel out — far from it! In fact, one value of b ("the coin is always tails") is completely eliminated by this evidence, and many extreme values of b ("almost always heads" and "almost always tails") are hit badly. That is, while the heads and the coins tails pull our beliefs in opposite directions, they don't pull with the same strength on all possible values of b.

We multiply the old belief

plot y = 2(1 - x), x = 0 to 1

by the additional pieces of evidence

and

and obtain the posterior relative density

plot y = 2(1 - x)x(1 - x), x = 0 to 1

which is proportional to the posterior probability

plot y = 12(1 - x)x(1 - x), x = 0 to 1

Writing out the whole operation from scratch:

P(b∣t1h2t3)=Lt1h2t3(b)⋅P(b)P(t1h2t3)=(1−b)⋅b⋅(1−b)⋅1∫10(1−b)⋅b⋅(1−b)⋅1db=12⋅b(1−b)2

Note that it's okay for a posterior probability density to be greater than 1, so long as the total probability mass isn't greater than 1. If there's probability density 1.2 over an interval of 0.1, that's only a probability of 0.12 for the true value to be found in that interval.

Thus, intuitively, Bayes' rule "just works" when calculating the posterior probability density from the prior probability density function and the (continuous) likelihood ratio function. A proof is beyond the scope of this guide; refer to .

Parents:
5
5
Ability to read calculus
functions
prior
Proof of Bayes' rule in the continuous case
likelihood
Math 3
posterior
Discussion3
Discussion3
Bayes' rule
Bayes' rule
Bayes' rule
Bayes' rule
Normalizing
normalized
Conditional probability