Epistemic Status: Realizing that there is even a problem took a while to see. I'm fairly confident the analysis presented is correct. After searching literature for a few hours, I realized it'd just be easier to write down the proof myself. I suppose there's nothing novel here except for the realization that there is a thing a lot of people think is obvious that I didn't.
I'm going to attempt and formalize a fairly specific interpretation of KL divergence. I'd like to motivate why this seems worth the effort by showing there is a gap in the literature. If you look around for an interpretation of KL divergence you'll find something resembling,
With KL divergence we can calculate exactly how much information is lost when we approximate one distribution with another.
The Wikipedia entry notes,
...[T]he KL divergence can be interpreted as the extra number of bits, on average, that are needed (beyond H(pA)) for encoding the events because of using pA for constructing the encoding scheme instead of pB.
This seems reasonable. Recall that entropy measures the optimal length of the codewords needed to compress data coming from pA. However, if we use an approximate distribution, our codewords will be longer than if we use the actual distribution which the code was designed for. So the KL divergence measures the sub-optimality of using a code for pA to compress things from pA.
Does this interpretation also imply the existence of a strategy to patch our approximate coding scheme to the proper distribution? The answer is not obvious. Just because we have two codes that have a small difference in compression ability does not immediately imply that there exists a small amount of data that you can add to the codes to get one from the other. Nevertheless, I saw this claim on Wikipedia,
Data compression can be seen as a special case of data differencing...one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.
While the citation certainly backs up the claim that you can see data compression as a differencing from nothing, the article doesn't mention relative entropy (KL Divergence) a single time. After much searching, I was also able to find an excerpt from a master's thesis stating that,
From a pure mathematical standpoint, the delta encoding aims to create a patch for
any given file ideally within the absolute entropy, but in reality the patches are
subject to the Kullback–Leibler divergence.
However, the citation for this just takes you to the paper where KL divergence is first introduced. It turns out that Kullback and Leibler simply suggest KL divergence has this property!
...then we define logpB(x)pA(x) as the information in x for discrimination between H1 and H2...
There's another way of communicating the difficulty. In chapter five of Elements of Information Theory, Thomas presents an interpretation of entropy as how many questions the average number of questions needed to identify a random variable in a game of “20 questions.” If the ideal strategy uses 12 questions on average and my strategy uses 14 does that mean I should expect to ask an additional two questions? Yes! Does that mean that effectively only two of our questions differ? Harder! The answer turns out to be yes and the cost is proportional to the KL divergence between the two distributions.
Patches
My goal is to concretely correspond KL divergence to a patching procedure that will take the code we have for pA and patch it so that it can be decoded by pB. I'll first need to introduce the concepts of patching and arithmetic coding and then the analysis will consist of finding a way to keep the codes close enough so that there's always a simple way to correct discrepancies as they arise.
We'll refer to a patch as a sequence of changes to a data structure designed to improve it. To be concrete we will study how to patch a arithmetic code so that the approximate code will output the same codewords as the optimal code. This is the minimum patch so that two compression schemes can communicate. For right now, you can just think of a patch as building a patch for a binary file such as,
0000111010101001010001010010
that will encode how to convert into a different file such as,
0010111011111000010001010010
There are a number of open-source and commercially available patching tools. More generally, Linux has a diff utility that will show the differences between two input files.
Arithmetic Codes
A basic problem of information theory is that we receive a stream of binary variables a0,…,an sampled from the Bernoulli distributionpA and would like to compress the stream as best as possible. The entropy H(pA) gives us an asymptotic bound on the expected number of symbols we will have to use to encode words an=an…a0 as n→∞.
For an arithmetic code, we represent our codewords using intervals by associating operators ^a0 and ^a1 with each symbol. We start with the unit interval and then compute the image of the interval under the current symbol ai of our stream. In this way, we can associate every word an to an interval ^an=^an^an−1…^a0. Just for fun, we note that this is a representation for the free monoid corresponding to our stream. We could also think of this an iterated function system (IFS) or a fractal with measure. We have an example of what this representation looks like for pA that takes the value 0 or 1 with equal probability.
We can define our arithmetic code for pA as the function A:ΩA→[0,1]2 that takes words in the sample space of pA to intervals that are subsets of the unit interval I=[0,1] via,
A(a)=A(an…a0)=^an…^a0(I)=Ian…a0=Ian
What's interesting is that you can approximate the minimal number of digits it takes to specify a representative point x∈Ia by calculating −log(μ(Ian)) where μ is the measure or length of the interval under consideration. The clearest way to see this is to take the logarithm as base 10. If we have an interval of the form [l,r] then the length is r−l and we'll have a series of zero digits until we get to the first place the numbers differ. If we take the logarithm then this will measure the number of digits that matched.
We may also use μA to refer to μ(A(w)). If we receive a random word of length n we can calculate the number of digits needed to represent the word as,
−log(μ(Ian))=−1nn∑i=1log(μ(Iai))→pAEpA[−log(μ(Ia))]=H(pA,μA)
Where the right-side indicates the cross-entropy. It's worth going back to the "20 questions" interpretation of entropy here. Say I pick a random point x sampled from pA on the unit interval and you can only ask questions of the form: x∈A(an)? How many questions does it take to determine x? The Thomas result in this context tells us that the number of questions we need to ask is bounded by H(pA). We also see that if we changed our definition of ^a so that μA≠pA the cross-entropy would go up so in some sense there is a natural way to define our operators.
Now we can be clearer about why there is difficulty in trying to interpret KL-divergence as the number of additional questions we need to ask to determine our variable. Because A(an)≠B(bn) all of our questions are different and so naively we'd expect we'd have to alter every question we'd ask under A which would be wildly inefficient from an information theoretic stand-point. However, as we'll see in a moment, there is a way to patch this difference if we assume pA∼pB.
How to Patch the Gap
So we've established the optimality of arithmetic codes if we pick our transformations so that the length of the intervals is identical to the probability of the word it represents. However, our problem is that we want to patch A so that it better approximates B. If pA and pB are really close we'd expect A and B to be close as well and this is true, at least visually. We can modify the probabilities so 0 is a bit more likely than 1 and see what happens.
This looks at least somewhat similar. We see this more clearly by highlighting which portions of the intervals overlap,
We can see that the intervals overlap takes several iterations to vanish. When the intervals overlap we can use identical representatives for the intervals generated by A and B. However, if pA≠pB only words up to a certain length will have such a perfect correspondence. There will be a minimum nc such that,
p(A(wnc)∩B(wnc)=∅)>0
Certainly n>0 because both codes start with the unit interval. We also know n>1 because both codes are binary and will share either 0 or 1 since only one division has been made at this point. As pA→pB it wouldn't be hard to show that the critical number nc→∞.
The utility of studying this flavor of correspondence is that it lets us define a local mechanism to patch intervals that are drifting apart. Consider a word w such that A(w)∩B(w)≠∅ then we would like to bound how far apart the intervals are in hamming distance. To be more specific, we'd like to see how much additional information is needed to map A(w) onto B(w). First, because the intervals overlap we would only need a constant number of digits to align one side. Second, we can make μA(w)=μB(w) by introducing a scaling factor μA(w)/μB(w) which takes log(μA/μB) digits to specify and indicates to add digits if μA(w)>μB(w) or remove digits if μA(w)<μB(w).
An example of the fixing is shown in the figure below. For example, for 000 we need to make the interval larger, but then we also need to make 100 shorter so the idea is that we use some of the information we're deleting from 000 to help fix 100.
Thus, ignoring an initial overhead cost, we could theoretically coordinate digit deletions so that they specify the digits to add to another interval. If you don't like that interpretation then specify the edit distance using |log(μA/μB)|. We could also specify a more general measure of interval divergence using a F-divergence. More on that in a moment.
Say we receive a random word wn from pB such that n<nc. There won't be a problem encoding our word. However, the intervals will have diverged a bit. In addition to supplying A(wn) we also need to supply patching information that will realign A(wn) with B(wn). We already determined a cost for such patching so in the limit of receiving a sequence of words the expected patching cost amounts to,
Thus, once we create our optimal arithmetic codes the KL divergence bounds the complexity of the patching needed so that A operates identically to B. Say you disagree with the choice of edit distance and think we should've used the absolute difference: KL divergence is still an asymptotic lower-bound on this complexity. The caveat is that, in addition to receiving many words, we also need our code correspondence to hold for large words.
Continuing with the "20 questions" example, we see that the solution is to call two lines of questioning equivalent if A(wn)∩B(wn) is non-empty. The average amount of information we need to add to each question in a line of questioning A(wn) so that it will continue to correspond to B(wn) is bounded by the KL divergence.
Non-Asymptotic Convergence and Beyond
Our previous analysis required nc>>1 which means that we need pA∼pB for a strong guarantee. There is a way to extend this result. Since we're using A to approximate B we can parameterize pA with θ0 and pB with θ=θ0+δθ and choose our digit F-divergence as,
Df(pB||pA)=−∂2log(pA(w,θ))∂θ2≥0
Which has the nice property of being non-negative. Moreover, to second-order we end up with,
E[Df(pB||pA)]∼12DKL(pB||pA)δθ2
Where the left-hand side is the fisher information metric. Now we could generalize to non-close distributions by finding the length of the geodesic between pA and pB. This basically gives us the continuum limit of a sequence of a chained divergences between the distributions. We could reasonably take this as the natural non-perturbative patching complexity between two distributions.
In general, there's a lot more we could say on this topic. For example, the main result showing that our patch complexity converges to KL divergence also matches the best asymptotic rate two statistical hypotheses can be distinguished which was alluded to by Kullback and Leibler in their paper and is discussed in more depth by Cover and Thomas. The main reason I decided to not proceed by the statistical testing route was because of the delicacy of handling the critical number between A and B. The main problem is that by the time we have enough evidence to distinguish our two hypotheses the arithmetic codes have already diverged which means that there's a lot of inefficiency in creating a patch that way.
Epistemic Status: Realizing that there is even a problem took a while to see. I'm fairly confident the analysis presented is correct. After searching literature for a few hours, I realized it'd just be easier to write down the proof myself. I suppose there's nothing novel here except for the realization that there is a thing a lot of people think is obvious that I didn't.
I'm going to attempt and formalize a fairly specific interpretation of KL divergence. I'd like to motivate why this seems worth the effort by showing there is a gap in the literature. If you look around for an interpretation of KL divergence you'll find something resembling,
The Wikipedia entry notes,
This seems reasonable. Recall that entropy measures the optimal length of the codewords needed to compress data coming from pA. However, if we use an approximate distribution, our codewords will be longer than if we use the actual distribution which the code was designed for. So the KL divergence measures the sub-optimality of using a code for pA to compress things from pA.
Does this interpretation also imply the existence of a strategy to patch our approximate coding scheme to the proper distribution? The answer is not obvious. Just because we have two codes that have a small difference in compression ability does not immediately imply that there exists a small amount of data that you can add to the codes to get one from the other. Nevertheless, I saw this claim on Wikipedia,
While the citation certainly backs up the claim that you can see data compression as a differencing from nothing, the article doesn't mention relative entropy (KL Divergence) a single time. After much searching, I was also able to find an excerpt from a master's thesis stating that,
However, the citation for this just takes you to the paper where KL divergence is first introduced. It turns out that Kullback and Leibler simply suggest KL divergence has this property!
There's another way of communicating the difficulty. In chapter five of Elements of Information Theory, Thomas presents an interpretation of entropy as how many questions the average number of questions needed to identify a random variable in a game of “20 questions.” If the ideal strategy uses 12 questions on average and my strategy uses 14 does that mean I should expect to ask an additional two questions? Yes! Does that mean that effectively only two of our questions differ? Harder! The answer turns out to be yes and the cost is proportional to the KL divergence between the two distributions.
Patches
My goal is to concretely correspond KL divergence to a patching procedure that will take the code we have for pA and patch it so that it can be decoded by pB. I'll first need to introduce the concepts of patching and arithmetic coding and then the analysis will consist of finding a way to keep the codes close enough so that there's always a simple way to correct discrepancies as they arise.
We'll refer to a patch as a sequence of changes to a data structure designed to improve it. To be concrete we will study how to patch a arithmetic code so that the approximate code will output the same codewords as the optimal code. This is the minimum patch so that two compression schemes can communicate. For right now, you can just think of a patch as building a patch for a binary file such as, 0000111010101001010001010010 that will encode how to convert into a different file such as, 0010111011111000010001010010 There are a number of open-source and commercially available patching tools. More generally, Linux has a diff utility that will show the differences between two input files.
Arithmetic Codes
A basic problem of information theory is that we receive a stream of binary variables a0,…,an sampled from the Bernoulli distribution pA and would like to compress the stream as best as possible. The entropy H(pA) gives us an asymptotic bound on the expected number of symbols we will have to use to encode words an=an…a0 as n→∞.
For an arithmetic code, we represent our codewords using intervals by associating operators ^a0 and ^a1 with each symbol. We start with the unit interval and then compute the image of the interval under the current symbol ai of our stream. In this way, we can associate every word an to an interval ^an=^an^an−1…^a0. Just for fun, we note that this is a representation for the free monoid corresponding to our stream. We could also think of this an iterated function system (IFS) or a fractal with measure. We have an example of what this representation looks like for pA that takes the value 0 or 1 with equal probability.
We can define our arithmetic code for pA as the function A:ΩA→[0,1]2 that takes words in the sample space of pA to intervals that are subsets of the unit interval I=[0,1] via, A(a)=A(an…a0)=^an…^a0(I)=Ian…a0=Ian
What's interesting is that you can approximate the minimal number of digits it takes to specify a representative point x∈Ia by calculating −log(μ(Ian)) where μ is the measure or length of the interval under consideration. The clearest way to see this is to take the logarithm as base 10. If we have an interval of the form [l,r] then the length is r−l and we'll have a series of zero digits until we get to the first place the numbers differ. If we take the logarithm then this will measure the number of digits that matched.
We may also use μA to refer to μ(A(w)). If we receive a random word of length n we can calculate the number of digits needed to represent the word as, −log(μ(Ian))=−1nn∑i=1log(μ(Iai))→pAEpA[−log(μ(Ia))]=H(pA,μA) Where the right-side indicates the cross-entropy. It's worth going back to the "20 questions" interpretation of entropy here. Say I pick a random point x sampled from pA on the unit interval and you can only ask questions of the form: x∈A(an)? How many questions does it take to determine x? The Thomas result in this context tells us that the number of questions we need to ask is bounded by H(pA). We also see that if we changed our definition of ^a so that μA≠pA the cross-entropy would go up so in some sense there is a natural way to define our operators.
Now we can be clearer about why there is difficulty in trying to interpret KL-divergence as the number of additional questions we need to ask to determine our variable. Because A(an)≠B(bn) all of our questions are different and so naively we'd expect we'd have to alter every question we'd ask under A which would be wildly inefficient from an information theoretic stand-point. However, as we'll see in a moment, there is a way to patch this difference if we assume pA∼pB.
How to Patch the Gap
So we've established the optimality of arithmetic codes if we pick our transformations so that the length of the intervals is identical to the probability of the word it represents. However, our problem is that we want to patch A so that it better approximates B. If pA and pB are really close we'd expect A and B to be close as well and this is true, at least visually. We can modify the probabilities so 0 is a bit more likely than 1 and see what happens.
This looks at least somewhat similar. We see this more clearly by highlighting which portions of the intervals overlap,
We can see that the intervals overlap takes several iterations to vanish. When the intervals overlap we can use identical representatives for the intervals generated by A and B. However, if pA≠pB only words up to a certain length will have such a perfect correspondence. There will be a minimum nc such that, p(A(wnc)∩B(wnc)=∅)>0 Certainly n>0 because both codes start with the unit interval. We also know n>1 because both codes are binary and will share either 0 or 1 since only one division has been made at this point. As pA→pB it wouldn't be hard to show that the critical number nc→∞.
The utility of studying this flavor of correspondence is that it lets us define a local mechanism to patch intervals that are drifting apart. Consider a word w such that A(w)∩B(w)≠∅ then we would like to bound how far apart the intervals are in hamming distance. To be more specific, we'd like to see how much additional information is needed to map A(w) onto B(w). First, because the intervals overlap we would only need a constant number of digits to align one side. Second, we can make μA(w)=μB(w) by introducing a scaling factor μA(w)/μB(w) which takes log(μA/μB) digits to specify and indicates to add digits if μA(w)>μB(w) or remove digits if μA(w)<μB(w).
An example of the fixing is shown in the figure below. For example, for 000 we need to make the interval larger, but then we also need to make 100 shorter so the idea is that we use some of the information we're deleting from 000 to help fix 100.
Thus, ignoring an initial overhead cost, we could theoretically coordinate digit deletions so that they specify the digits to add to another interval. If you don't like that interpretation then specify the edit distance using |log(μA/μB)|. We could also specify a more general measure of interval divergence using a F-divergence. More on that in a moment.
Say we receive a random word wn from pB such that n<nc. There won't be a problem encoding our word. However, the intervals will have diverged a bit. In addition to supplying A(wn) we also need to supply patching information that will realign A(wn) with B(wn). We already determined a cost for such patching so in the limit of receiving a sequence of words the expected patching cost amounts to,
−1kk∑i=1log(μA(wni)μB(wni))→n,k→∞EpB[−log(μA(w)μB(w))]=DKL(pB||μA)−DKL(pB||μB)
Finally, since μA=pA and μB=pB we have,
DKL(pB||μA)−DKL(pB||μB)→DKL(pB||pA)
Thus, once we create our optimal arithmetic codes the KL divergence bounds the complexity of the patching needed so that A operates identically to B. Say you disagree with the choice of edit distance and think we should've used the absolute difference: KL divergence is still an asymptotic lower-bound on this complexity. The caveat is that, in addition to receiving many words, we also need our code correspondence to hold for large words.
Continuing with the "20 questions" example, we see that the solution is to call two lines of questioning equivalent if A(wn)∩B(wn) is non-empty. The average amount of information we need to add to each question in a line of questioning A(wn) so that it will continue to correspond to B(wn) is bounded by the KL divergence.
Non-Asymptotic Convergence and Beyond
Our previous analysis required nc>>1 which means that we need pA∼pB for a strong guarantee. There is a way to extend this result. Since we're using A to approximate B we can parameterize pA with θ0 and pB with θ=θ0+δθ and choose our digit F-divergence as, Df(pB||pA)=−∂2log(pA(w,θ))∂θ2≥0 Which has the nice property of being non-negative. Moreover, to second-order we end up with, E[Df(pB||pA)]∼12DKL(pB||pA)δθ2 Where the left-hand side is the fisher information metric. Now we could generalize to non-close distributions by finding the length of the geodesic between pA and pB. This basically gives us the continuum limit of a sequence of a chained divergences between the distributions. We could reasonably take this as the natural non-perturbative patching complexity between two distributions.
In general, there's a lot more we could say on this topic. For example, the main result showing that our patch complexity converges to KL divergence also matches the best asymptotic rate two statistical hypotheses can be distinguished which was alluded to by Kullback and Leibler in their paper and is discussed in more depth by Cover and Thomas. The main reason I decided to not proceed by the statistical testing route was because of the delicacy of handling the critical number between A and B. The main problem is that by the time we have enough evidence to distinguish our two hypotheses the arithmetic codes have already diverged which means that there's a lot of inefficiency in creating a patch that way.