There are two ways to calculate the amount of information in one term of a continued fraction:
- The entropy of the Gauss-Kuzmin distribution is about 3.4325 bits.
- Twice the logarithm of the Khinchin-Lévy constant is about 3.4237 bits.
These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I'm surprised by how close they are.
π is about 3.
Actually, 3 is a bit too small, by about 1/7.
Actually, 7 is a bit too small, by about 1/15.
Actually, 15 is
a bit a lot too small, by almost 1/1.
But the 1 in the denominator is a tiny bit too small, by about 1/292.
We can keep doing this forever, and get an expression like:
This is pretty useful for things like finding rational approximations. We can drop that 1/(292+...), since it's really small, and get the excellent approximation . Or we can drop the 1/(15+...) and get the rougher approximation .
In general, the continued fraction for a real number is of the form
where the terms are positive integers (except which is an arbitrary integer).
This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.
Truncating a continued fraction is the optimal way to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.
Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.
The Gauss-Kuzmin distribution
I think it's also true that the fraction of 's in the continued fraction converges to .
The entropy of this distribution is just , which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about bits, so that's how much information there is in the term , for large .
Lochs's theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of per term. This is amusingly close to 10, which is how much the accuracy of decimals improves per digit.
You get bits of information from a digit of decimal, so this means that you get bits of information per term of a continued fraction, asymptotically.
So what gives? Why aren't those the same??
It's not about how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.
The answer, unless I'm mistaken:
The Gauss-Kuzmin distribution is a marginal distribution, so its entropy is the entropy of the nth term if you don't know anything about previous terms.
The Khinchin-Lévy constant is the entropy of the conditional distribution; it measures the entropy of the nth term if you know every previous term.
These are different because -- evidently -- different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large n!
I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.
It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.
I haven't tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I'm tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome -- "the next term will probably not be 283434" -- or spread out evenly across the whole distribution? Presumably there are results on this in the literature?
But I'm still confused about why there's only a 2% difference. Why should these numbers be so close, but still different?
If you have any intuition for this, please let me know.
If I'm remembering this right, you truncate by replacing a term with either or . (In other words, by rounding down to zero or up to one.)
The logarithm has to be base 2 to make the probabilities add up to 1. Isn't that weird?
Formally: with probability 1, p(k) is the limit of the fraction of 's in the first terms, as .
This part of the post originally said this was equivalent to the previous statement, but it's actually stronger.
Yes, there's gonna a double logarithm in there.
Just to throw another theorem on the pile: for the rational approximation you get by truncating after terms, the th root of the denominator approaches as . This is the square root of the number in Lochs's theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.
Why the square root? I think because the accuracy of these approximations goes like the square of the denominator -- see also Dirichlet's approximation theorem.
Lochs's theorem came late; I assume it relies on Lévy's work but I haven't seen the proof.
One factor of in the denominator is to make the unit "bits", and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having in it, I believe).