Epistemic Status: Realizing that there is even a problem took a while to see. I'm fairly confident the analysis presented is correct. After searching literature for a few hours, I realized it'd just be easier to write down the proof myself. I suppose there's nothing novel here except for the realization that there is a thing a lot of people think is obvious that I didn't.
I'm going to attempt and formalize a fairly specific interpretation of KL divergence. I'd like to motivate why this seems worth the effort by showing there is a gap in the literature. If you look around for an interpretation of KL divergence you'll find something resembling,
With KL divergence we can calculate exactly how much information is lost when we approximate one distribution with another.
The Wikipedia entry notes,
...[T]he KL divergence can be interpreted as the extra number of bits, on average, that are needed (beyond ) for encoding the events because of using for constructing the encoding scheme instead of .
This seems reasonable. Recall that entropy measures the optimal length of the codewords needed to compress data coming from . However, if we use an approximate distribution, our codewords will be longer than if we use the actual distribution which the code was designed for. So the KL divergence measures the sub-optimality of using a code for to compress things from .
Does this interpretation also imply the existence of a strategy to patch our approximate coding scheme to the proper distribution? The answer is not obvious. Just because we have two codes that have a small difference in compression ability does not immediately imply that there exists a small amount of data that you can add to the codes to get one from the other. Nevertheless, I saw this claim on Wikipedia,
Data compression can be seen as a special case of data differencing...one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.
While the citation certainly backs up the claim that you can see data compression as a differencing from nothing, the article doesn't mention relative entropy (KL Divergence) a single time. After much searching, I was also able to find an excerpt from a master's thesis stating that,
From a pure mathematical standpoint, the delta encoding aims to create a patch for any given file ideally within the absolute entropy, but in reality the patches are subject to the Kullback–Leibler divergence.
However, the citation for this just takes you to the paper where KL divergence is first introduced. It turns out that Kullback and Leibler simply suggest KL divergence has this property!
...then we define as the information in for discrimination between and ...
There's another way of communicating the difficulty. In chapter five of Elements of Information Theory, Thomas presents an interpretation of entropy as how many questions the average number of questions needed to identify a random variable in a game of “20 questions.” If the ideal strategy uses 12 questions on average and my strategy uses 14 does that mean I should expect to ask an additional two questions? Yes! Does that mean that effectively only two of our questions differ? Harder! The answer turns out to be yes and the cost is proportional to the KL divergence between the two distributions.
My goal is to concretely correspond KL divergence to a patching procedure that will take the code we have for and patch it so that it can be decoded by . I'll first need to introduce the concepts of patching and arithmetic coding and then the analysis will consist of finding a way to keep the codes close enough so that there's always a simple way to correct discrepancies as they arise.
We'll refer to a patch as a sequence of changes to a data structure designed to improve it. To be concrete we will study how to patch a arithmetic code so that the approximate code will output the same codewords as the optimal code. This is the minimum patch so that two compression schemes can communicate. For right now, you can just think of a patch as building a patch for a binary file such as, that will encode how to convert into a different file such as, There are a number of open-source and commercially available patching tools. More generally, Linux has a diff utility that will show the differences between two input files.
A basic problem of information theory is that we receive a stream of binary variables sampled from the Bernoulli distribution and would like to compress the stream as best as possible. The entropy gives us an asymptotic bound on the expected number of symbols we will have to use to encode words as .
For an arithmetic code, we represent our codewords using intervals by associating operators and with each symbol. We start with the unit interval and then compute the image of the interval under the current symbol of our stream. In this way, we can associate every word to an interval . Just for fun, we note that this is a representation for the free monoid corresponding to our stream. We could also think of this an iterated function system (IFS) or a fractal with measure. We have an example of what this representation looks like for that takes the value or with equal probability.
We can define our arithmetic code for as the function that takes words in the sample space of to intervals that are subsets of the unit interval via,
What's interesting is that you can approximate the minimal number of digits it takes to specify a representative point by calculating where is the measure or length of the interval under consideration. The clearest way to see this is to take the logarithm as base 10. If we have an interval of the form then the length is and we'll have a series of zero digits until we get to the first place the numbers differ. If we take the logarithm then this will measure the number of digits that matched.
We may also use to refer to . If we receive a random word of length we can calculate the number of digits needed to represent the word as, Where the right-side indicates the cross-entropy. It's worth going back to the "20 questions" interpretation of entropy here. Say I pick a random point sampled from on the unit interval and you can only ask questions of the form: ? How many questions does it take to determine ? The Thomas result in this context tells us that the number of questions we need to ask is bounded by . We also see that if we changed our definition of so that the cross-entropy would go up so in some sense there is a natural way to define our operators.
Now we can be clearer about why there is difficulty in trying to interpret KL-divergence as the number of additional questions we need to ask to determine our variable. Because all of our questions are different and so naively we'd expect we'd have to alter every question we'd ask under which would be wildly inefficient from an information theoretic stand-point. However, as we'll see in a moment, there is a way to patch this difference if we assume .
How to Patch the Gap
So we've established the optimality of arithmetic codes if we pick our transformations so that the length of the intervals is identical to the probability of the word it represents. However, our problem is that we want to patch so that it better approximates . If and are really close we'd expect and to be close as well and this is true, at least visually. We can modify the probabilities so is a bit more likely than and see what happens.
This looks at least somewhat similar. We see this more clearly by highlighting which portions of the intervals overlap,
We can see that the intervals overlap takes several iterations to vanish. When the intervals overlap we can use identical representatives for the intervals generated by and . However, if only words up to a certain length will have such a perfect correspondence. There will be a minimum such that, Certainly because both codes start with the unit interval. We also know because both codes are binary and will share either or since only one division has been made at this point. As it wouldn't be hard to show that the critical number .
The utility of studying this flavor of correspondence is that it lets us define a local mechanism to patch intervals that are drifting apart. Consider a word such that then we would like to bound how far apart the intervals are in hamming distance. To be more specific, we'd like to see how much additional information is needed to map onto . First, because the intervals overlap we would only need a constant number of digits to align one side. Second, we can make by introducing a scaling factor which takes digits to specify and indicates to add digits if or remove digits if .
An example of the fixing is shown in the figure below. For example, for we need to make the interval larger, but then we also need to make shorter so the idea is that we use some of the information we're deleting from to help fix .
Thus, ignoring an initial overhead cost, we could theoretically coordinate digit deletions so that they specify the digits to add to another interval. If you don't like that interpretation then specify the edit distance using . We could also specify a more general measure of interval divergence using a F-divergence. More on that in a moment.
Say we receive a random word from such that . There won't be a problem encoding our word. However, the intervals will have diverged a bit. In addition to supplying we also need to supply patching information that will realign with . We already determined a cost for such patching so in the limit of receiving a sequence of words the expected patching cost amounts to,
Finally, since and we have,
Thus, once we create our optimal arithmetic codes the KL divergence bounds the complexity of the patching needed so that operates identically to . Say you disagree with the choice of edit distance and think we should've used the absolute difference: KL divergence is still an asymptotic lower-bound on this complexity. The caveat is that, in addition to receiving many words, we also need our code correspondence to hold for large words.
Continuing with the "20 questions" example, we see that the solution is to call two lines of questioning equivalent if is non-empty. The average amount of information we need to add to each question in a line of questioning so that it will continue to correspond to is bounded by the KL divergence.
Non-Asymptotic Convergence and Beyond
Our previous analysis required which means that we need for a strong guarantee. There is a way to extend this result. Since we're using to approximate we can parameterize with and with and choose our digit F-divergence as, Which has the nice property of being non-negative. Moreover, to second-order we end up with, Where the left-hand side is the fisher information metric. Now we could generalize to non-close distributions by finding the length of the geodesic between and . This basically gives us the continuum limit of a sequence of a chained divergences between the distributions. We could reasonably take this as the natural non-perturbative patching complexity between two distributions.
In general, there's a lot more we could say on this topic. For example, the main result showing that our patch complexity converges to KL divergence also matches the best asymptotic rate two statistical hypotheses can be distinguished which was alluded to by Kullback and Leibler in their paper and is discussed in more depth by Cover and Thomas. The main reason I decided to not proceed by the statistical testing route was because of the delicacy of handling the critical number between and . The main problem is that by the time we have enough evidence to distinguish our two hypotheses the arithmetic codes have already diverged which means that there's a lot of inefficiency in creating a patch that way.