The Joys of Conjugate Priors

This is a fantastic post! Well done.

That said, I have quibbles that relate to the philosophical import ascribed to the beta distribution:

the beta distribution is an excellent exemplar of the notion of the comparative weight of evidence in the prior vs the data, but the notion is much more general;
priors should ideally reflect the actual information at one's disposal, and thus should rarely actually be conjugate;
it's controversial to claim that alpha = beta = 1 expresses no prior knowledge; other proposals include the improper alpha = beta = 0 and Jeffreys' prior, alpha = beta = 0.5.

And one other complaint: using the notion of picking a "best" value of theta for prediction to motivate the subsequent discussion was a misstep. If prediction is the goal, then the Bayesian procedure is to formulate the joint distribution of theta and the as-yet-unobserved data and then treat theta as a nuisance parameter and integrate over it.

In spite of the above criticisms, I consider this post yeoman's work -- it deserves more upvotes than I can give it.

[-]TCB15y30

Thank you very much for the compliments, and for the honest criticism!

I am still thinking about your comment, and I intend to write a detailed response to it after I have thought about your criticisms more completely. In the meantime, though, I wanted to say that the feedback is very much appreciated!

[-]TCB15y20

After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn't my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I'll make sure to stress that the beta distribution is just one of many great conjugate priors!

I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?

I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?

Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I'll change that in my next draft of this document.

Again, thank you so much for the detailed criticism; it is very much appreciated! =)

[-]Cyan15y30

The improper alpha = beta = 0 prior, sometimes known as Haldane's prior, is derived using an invariance argument in Jaynes's 1968 paper Prior Probabilities. I actually don't trust that argument -- I find the critiques of it here compelling.

Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.

I have mostly used the uniform prior myself in the past, although I think in the future I'll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.

[-]jsteinhardt15y190

I think LessWrong could use more posts on actual technical topics in machine learning, and this is a nice first step. It would be good if there was a sequence on it.

You might want to include the link to the Wikipedia table of conjugate priors in your post, and at least a mention of exponential families.

If you're a smart Bayesian agent, then, you'll pick p(theta) to be a conjugate prior

While conjugate priors can be very useful computationally, it might also be the case that your data is not well-modeled by the conjugate prior (if you're using the Naieve bayes model then this might not seem like a huge problem, but once you start trying to build hierarchical models using conjugate priors, you have more potential to run into problems).

[-]TCB15y40

I would love to see an LW sequence on machine learning! I imagine that LW would have a lot of interesting things to say about the philosophical aspects of ML in addition to the practical aspects.

I'm not sure I'd be qualified to contribute much to such a sequence, since I am just an undergrad, but I did have an outline in mind for an intuitive introduction to MLE and EM. If people would find that interesting, I could certainly post it on LW once it was written up!

I'm fairly inexperienced in ML, so all the models I've worked with are simple enough that they've had conjugate priors. (I think it's really cool that Dirichlet priors can be used for something as complicated as an HMM, but I guess the HMM is still just a whole bunch of multinomials.) I'm less familiar with hierarchical models. What is an example of a model for which is it difficult to use conjugate priors? The only hierarchical process I've heard about is the Dirichlet process, and I was under the impression (based on the name) that it involved Dirichlet priors somewhere; is this incorrect? I have been meaning to read about hierarchical models, so if you know of any good tutorials or papers on them, I would very much appreciate a link!

[-]jsteinhardt15y10

Cyan's observation about mixtures of conjugate priors being conjugate kills the example I had in mind. Ill think for a bit and let you know if I think of any examples. If I haven't replied in a couple weeks, remind me and ill make sure to reply.

Dirichlet processes aren't inherently hierarchical, they are just self-conjugate, so you can make the output of one the input to the other. If you connect them up in a tree structure, you get a hierarchical dirichlet process.

[-]Cyan15y10

Andrew Gelman wrote a comment on someone else's paper that might prove to be a useful introduction to hierarchical models.

[-]Eugine_Nier15y70

But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions.

This is in essence the narrative fallacy. While it can be a useful heuristic, there are dangers for example causing you no neglect outliers and black swans.

[-]Tyrrell_McAllister15y20

This is an excellent article. However, I did have the same philosophical problem that Cyan gave in this bullet point:

priors should ideally reflect the actual information at one's disposal, and thus should rarely actually be conjugate;

You seem to suggest that conjugate prior distributions are "smart" because they update in a computationally tractable way. Certainly, as a concession to practical necessity, we have to take computational tractability into account. But it is controversial to think of doing this as part of the ideal epistemology that we are trying to approximate.

Also, I found myself confused at a few points near the beginning. You write

While going about your daily activities, you observe an event of type x. Because you're a good Bayesian, you have some internal parameter \beta which represents your belief that x will occur.

Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of x is a datapoint, and thus you'll want to modify \beta. But how much should this datapoint influence \beta?

At first, I misread you as saying, in effect, "Given that x occurs, what should be your updated probability that x occurs?" But, of course, your updated probability, conditioned on x's occurring, that x occurs, should be 1.

I also misunderstood you to be proposing to consider the probability of the probability of a given event being such-and-such. That is, I thought that you were proposing to consider a probability of the form P(P(x | y) = p | z), where x, y, and z are events, and p is a number in [0,1]. But, as I understand it, this is not a well-formed notion in Bayesian epistemology.

I think that my confusion arose from your calling \beta an "internal parameter". But, from the subsequent discussion, it seems better to think of \beta as an unknown parameter fed into whatever physical process generated x. For example, \beta could be an unknown parameter fed into a pseudo-random number generator that was observed to output the number x.

[-][anonymous]11y00

This was great to read, but as a layman I find it a bit inconvenient that there is not a block consisting solely of example, from the beginning to the end. I mean, the key words are easily googlable, but somehow, on those few webpages I have looked at, there never is a step-by-step account of what one does in this situation:(

[-]enfascination12y00

This work articulates an attack on the use of conjugate priors in a Bayesian analysis: http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ba/1340369826 In their words, "conjugate priors may lead to a dogmatic analysis."

Sorry for necro.

[-]Thomas15y00

Well ... what bothers me? That the alpha and beta should have their own probability distribution each. And so on and on.

[-]jsteinhardt15y00

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.5680&rep=rep1&type=pdf

EDIT: They don't extend the hierarchy infinitely upward in the paper, but there is no reason not to, as far as I can see.

[-]Cyan15y10

Like so: Infinite hierarchies and prior distributions.

[-]jsteinhardt15y00

Thanks for the link! Should make good reading. It sounds like you know a fair amount of ML, are you doing research in the field?

[-]Cyan15y00

I taught myself Bayesian statistics for use in my engineering Ph.D. (My advisor didn't care how I got good answers -- he cared that I got good answers.) Until recently I was a postdoc in a statistics lab, but the research did not focus on what I would consider cutting-edge Bayesian stats.

[-]Oscar_Cunningham15y00

Does each likelihood distribution have a unique conjugate prior? I doesn't seem immediately obvious that they do, but people say things like "The conjugate prior for the bernoulli distribution is the beta distribution".

[-]Cyan15y00

No, in general are many conjugate priors for a given likelihood, if for no other reason than any weighted mixture of conjugate priors is also a conjugate prior.

[-]Matt_Simpson15y00

What about the converse - does a conjugate prior exist for each likelihood (assume "nice" families of probability measures with a R-N derivative w.r.t counting measure or lebesgue measure if you like)? I think probably not (with a fairly high degree of certainty) but I don't think I've ever seen a proof of it.

[-]Cyan15y30

The existence of a conjugate prior is not guaranteed. They exist for members of the exponential family, which is a very broad and useful class of distributions. I don't know of a proof, but if a gun were held to my head, I'd assert with reasonable confidence that the Cauchy likelihood doesn't have a conjugate prior.

[-]alex_zag_al12y20

I'm pretty sure that the Cauchy likelihood, like the other members of the t family, is a weighted mixture of normal distributions. (Gamma distribution over the inverse of the variance)

EDIT: There's a paper on this, "Scale mixtures of normal distributions" by Andrews and Mallows, if you want the details

[-]Cyan12y30

Oh, for sure it is. But that only gives it a conditionally conjugate prior, not a fully (i.e., marginally) conjugate prior. That's great for Gibbs sampling, but not for pen-and-paper computations.

In the three years since I wrote the grandparent, I've found a nice mixture representation for any unimodal symmetric distribution:

Suppose f(x), the pdf for a real-valued X, is unimodal and symmetric around 0. If W is positive-valued with pdf g(w) = -w f '(w) and U ~ Unif(-W, W), then U's marginal distribution is the same as X. Proof is by integration-by-parts. ETA: No, wait, it's direct. Derp.

I don't think it would be too hard to convert this width-weighted-mixture-of-uniforms representation to a precision-weighted-mixture-of-normals representation.

[-]Matt_Simpson15y00

It turns out that it's not too difficult to construct a counter example if you restrict the hyper-parameter space of the family of prior distributions. For example, let the likelihood, f(x|theta) only take on two values of theta, so the prior just puts mass p on theta=0 (i.e. P(theta=0) = p )and mass 1-p on theta=1. If you restrict p < 0.5, then the posterior will yield a distribution on theta with p > 0.5 for some likelihoods and some values of x.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

63

The Joys of Conjugate Priors

63

63