As a teaser here is the visual version of Bayesian updating:
But in order to understand that figure we need to go through the prior and likelihood!
You find me standing in a basketball court ready to shoot some hoops. What do
you believe about my performance before I take a shot?. There are no good Null
hypothesis here unless you happen to have a lot of knowledge about the average
human basket ball performance!, and even so, why do you care whether I am
significant different from the average?, You can fall back to the new
which is almost as good as the Bayesian approach, it but does not answer what you
should believe before I take a shot.
The Beta distribution is a popular prior for binary events, when the two parameter (α and β) are equal to 1, it is uniform. Since you my dear reader have no concept about my basket skills you assume a θ comes from a Beta(1,1) distribution, formally:
Where θ is my probability of scoring, the distribution looks like this:
Completely Uniform, a great prior when you are totally oblivious.
I take a shot and miss (z=0), the likelihood of a miss looks like this:
(if you are extra currious, you can brush up on the math behind all the binary distributions here)
Notice that these likelihoods and not probabilities, but how likely the data are for different values of θ, so it is twice as likely:
That the data z=0 was generated by θ=0 compared to θ=0.5.
Here is Bayes theorem for the Bernoulli distribution with a Beta prior, where
the parameter z is 1 when I score and 0 otherwise:
For technical reason p(z), the probability of the data, is difficult to
calculate, it is however 'just a normalization constant' because it does not
depend on θ which is my scoring probability, thus we can simply drop it
and get an unnormalized posterior:
An unnormalized posterior is simply a density function that does not sum to 1,
when we plot it, it looks 'correct' except we have screwed up the numbers on
the y axis.
So now we have a 'square' prior p(θ)∼Beta(1,1) and we have a triangle likelihood p(z=0∣θ), if we multiply them together we get the unnormalized posterior, so we do:
Which intuitively can be taught of as: the square makes everything equally likely, so the likelihood will dominate the posterior, or in dodgy math:
Here is the Figure:
Try to put your finger on the figure check that θ=0.5 is 1 for the
square and 0.5 for the triangle and is thus 1×0.5=0.5 in the
I shoot again and score!
Now we use the previous posterior as the new prior, but because we score we get an 'opposite triangle' which is the likelihood of p(z=1∣θ)
Again we multiply the prior triangle by the likelihood triangle and get a blob
centered on 0.5 as the posterior:
Notice how the posterior is peaked at θ=0.5, this is because the two
triangles at the center have an unnormalized posterior density of
0.5×0.5=0.25 where at edges such as θ=0.9 they have
So now again the previous blob posterior is our new prior, which we multiply by
the 'I scored triangle' resulting in a blob that has a mode above 0.5, which
makes sense as I made 2/3 shots:
While this may seem like a cute toy example it's a totally valid way of solving
a Bayesian posterior, and is the way all most popular bayesian books (Gelman,
Kruschke and McElreath) introduce the concept!
In the case of the Bernoulli events we can actually solve the posterior easily
because the Beta is conjugated to the Bernoulli, conjugation is simply fancy statistics speak for it having a simple mathematical form, and that form is also a Beta distribution, thus you can update the beta distribution using this simple rule:
So we Started with a prior with α=β=1
Then we got a miss, z=0
Then we got a hit, z=1
We can plot the Beta(3,2) posterior
Notice how the this posterior has the exact same shape as the one we got via
updating, the only different is the numbers on the y-axis.
(Hi, if you made it this far please comment, if there were something that was not well explained, I care more about my statistics communication skills than my ego, so negative feedback is very welcome)
Gelman, Hill and Vehtari, “Regression and Other Stories” ↩︎
Richard McElreath "Statistical Rethinking" ↩︎
John Kruschke "Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan 2nd Edition" ↩︎
I am well aware that nobody asked for this, but here is the proof that the posterior is Beta(α+z,β−z+1) for the beta-bernoulli model.
We start with Bayes Theorem:
Then we plug in the definition for the Bernoulli likelihood and Beta prior:
Let's collect the powers in the numerator, and things that does not depend on
θ in the denominator
Here comes the conjugation shenanigans. If you squint, the top of the distribution looks like
the top of a Beta distribution:
Let's continue the shenanigans, since the numerator looks like the numerator of
a beta distribution, we know that it would be a proper beta
distribution if we changed the denominator like this:
Then we got a miss, z=1
Then we got a miss, z=1
I think "miss" here should be "hit".
The order does not matter, you can see that by focusing on θ=12 which is always equal to 1N2, you can also see it from the conjugation rule where you end with Beta(3,2) no matter the order.
If you wanted the order to matter you could down weight earlier shots or widen the uncertainty between the updates, so previous posterior becomes a slightly wider prior to capture the extra uncertainty from the passage of time.
I read all the way through, unfortunately this is pretty mathy for me and I don't have much to say. I did like the visualization of updates as squishing the blob around -- my own introduction to updating was via the sequences which didn't have that.
Mine was the same, I became a bayesian statetisian 4 years ago. I gave a talk about Bayesian Statistics and this figure was what made it click to most students (including myself), so i wanted to share it