A hundredth of a bit of extra entropy

28Oscar_Cunningham

5Adam Scherlis

1Czynski

7Oscar_Cunningham

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 6:49 AM

Nice idea! We can show directly that each term provides information about the next.

The density function of the distribution of the fractional part in the continued fractional algorithm converges to 1/[(1+x) ln(2)] (it seems this is also called the Gauss-Kuzmin distribution, since the two are so closely associated). So we can directly calculate the probability of getting a coefficient of n by integrating this from 1/(n+1) to 1/n, which gives -lg(1-1/(n+1)^2) as you say above. But we can also calculate the probability of getting an n followed by an m, by integrating this from 1/(n+1/m) to 1/(n+1/(m+1)), which gives -lg(1-1/(mn+1)(mn+m+n+2)). So dividing one by the other gives P(m|n) = lg(1-1/(mn+1)(mn+m+n+2))/lg(1-1/(n+1)^2), which is rather ugly, but the point is that it does depend on n.

This turns out to be an anticorrelation. High numbers are more likely to by followed by low numbers, and vice-versa. The probability of getting a 1 given you've just had a 1 is 36.6%, whereas if you've just had a very high number the probability of getting a 1 will be very close to 50% (since the distribution of the fractional part is tending to uniform).

There are two ways to calculate the amount of information in one term of a continued fraction:

These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I'm surprised by how close they are.

## Continued fractions

π is about 3.

Actually, 3 is a bit too small, by about 1/7.

Actually, 7 is a bit too small, by about 1/15.

Actually, 15 is

~~a bit~~a lot too small, by almost 1/1.But the 1 in the denominator is a tiny bit too small, by about 1/292.

We can keep doing this forever, and get an expression like:

π=3+17+115+11+1292+⋱

This is pretty useful for things like finding rational approximations. We can drop that 1/(292+...), since it's really small, and get the excellent approximation π≈355113. Or we can drop the 1/(15+...) and get the rougher approximation π≈227.

In general, the continued fraction for a real number r is of the form

r=a0+1a1+1a2+1a3+⋱

where the terms an are positive integers (except a0 which is an arbitrary integer).

This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.

355113=3+17+115+11

Truncating a continued fraction

^{[1]}is theoptimalway to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.

## The Gauss-Kuzmin distribution

Suppose we choose a random real number (say, uniformly between 0 and 1). The probability that the nth term is equal to k converges to p(k)=−log2(1−1(1+k)2) as n→∞.

^{[2]}This is the Gauss-Kuzmin distribution.I think it's also true that the fraction of k's in the continued fraction converges to p(k).

^{[3]}The entropy of this distribution is just ∑∞k=1−p(k)log2(p(k)),

^{[4]}which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about 3.4325 bits, so that's how much information there is in the term an, for large n.## Lochs's theorem

Lochs's theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of eπ2/6ln2≈10.731 per term.

^{[5]}This is amusingly close to 10, which is how much the accuracy ofdecimalsimproves per digit.You get log210 bits of information from a digit of decimal, so this means that you get π26(ln2)2≈3.4237 bits

^{[6]}of information per term of a continued fraction, asymptotically.## The difference

So what gives? Why aren't those the same??

Hint:

It's

notabout how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.The answer, unless I'm mistaken:

The Gauss-Kuzmin distribution is a

marginaldistribution, so its entropy is the entropy of thenth term if you don't know anything about previous terms.The Khinchin-Lévy constant is the entropy of the

conditionaldistribution; it measures the entropy of thenth term if you know every previous term.These are different because -- evidently -- different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large

n!I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.

It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.

I haven't tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I'm tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome -- "the next term will probably not be 283434" -- or spread out evenly across the whole distribution? Presumably there are results on this in the literature?

But I'm still confused about why there's only a 2% difference. Why should these numbers be so close, but still different?

If you have any intuition for this, please let me know.

^{^}If I'm remembering this right, you truncate by replacing a term an with either 1 or ∞. (In other words, by rounding 1/(an+…) down to zero or up to one.)

^{^}The logarithm has to be base 2 to make the probabilities add up to 1. Isn't that weird?

^{^}Formally: with probability 1, p(k) is the limit of the fraction of k's in the first N terms, as N→∞.

This part of the post originally said this was equivalent to the previous statement, but it's actually stronger.

^{^}Yes, there's gonna a double logarithm in there.

^{^}Just to throw another theorem on the pile: for the rational approximation you get by truncating after n terms, the nth root of the denominator approaches eπ2/12ln2 as n→∞. This is the square root of the number in Lochs's theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.

Why the square root? I think because the accuracy of these approximations goes like the

squareof the denominator -- see also Dirichlet's approximation theorem.Lochs's theorem came late; I assume it relies on Lévy's work but I haven't seen the proof.

^{^}One factor of ln2 in the denominator is to make the unit "bits", and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having log2 in it, I believe).