Six (and a half) intuitions for KL divergence
KL-divergence is a topic which crops up in a ton of different places in information theory and machine learning, so it's important to understand well. Unfortunately, it has some properties which seem confusing at a first pass (e.g. it isn't symmetric like we would expect from most distance measures, and it can be unbounded as we take the limit of probabilities going to zero). There are lots of different ways you can develop good intuitions for it that I've come across in the past. This post is my attempt to collate all these intuitions, and try and identify the underlying commonalities between them. I hope that for everyone reading this, there will be at least one that you haven't come across before and that improves your overall understanding! One other note - there is some overlap between each of these (some of them can be described as pretty much just rephrasings of others), so you might want to just browse the ones that look interesting to you. Also, I expect a large fraction of the value of this post (maybe >50%) comes from the summary, so you might just want to read that and skip the rest! Summary 1. Expected surprise > DKL(P||Q)= how much more surprised you expect to be when observing data with distribution P, if you falsely believe the distribution is Q vs if you know the true distribution 2. Hypothesis Testing > DKL(P||Q)= the amount of evidence we expect to get for P over Q in hypothesis testing, if P is true. 3. MLEs > If P is an empirical distribution of data, DKL(P||Q) is minimised (over Q) when Q is the maximum likelihood estimator for P. 4. Suboptimal coding > DKL(P||Q)= the number of bits we're wasting, if we try and compress a data source with distribution P using a code which is actually optimised for Q (i.e. a code which would have minimum expected message length if Q were the true data source distribution). 5A. Gambling games - beating the house > DKL(P||Q)= the amount (in log-space) we can win from a casino game, if we know the true