# 72

There are two ways to calculate the amount of information in one term of a continued fraction:

• The entropy of the Gauss-Kuzmin distribution is about 3.4325 bits.
• Twice the logarithm of the Khinchin-Lévy constant is about 3.4237 bits.

These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I'm surprised by how close they are.

# Continued fractions

Actually, 3 is a bit too small, by about 1/7.

Actually, 7 is a bit too small, by about 1/15.

Actually, 15 is a bit a lot too small, by almost 1/1.

But the 1 in the denominator is a tiny bit too small, by about 1/292.

We can keep doing this forever, and get an expression like:

This is pretty useful for things like finding rational approximations. We can drop that 1/(292+...), since it's really small, and get the excellent approximation . Or we can drop the 1/(15+...) and get the rougher approximation .

In general, the continued fraction for a real number  is of the form

where the terms  are positive integers (except  which is an arbitrary integer).

This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.

Truncating a continued fraction is the optimal way to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.

Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.

# The Gauss-Kuzmin distribution

Suppose we choose a random real number (say, uniformly between 0 and 1). The probability that the th term is equal to  converges to  as . This is the Gauss-Kuzmin distribution.

I think it's also true that the fraction of 's in the continued fraction converges to .

The entropy of this distribution is just , which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about  bits, so that's how much information there is in the term , for large .

# Lochs's theorem

Lochs's theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of  per term. This is amusingly close to 10, which is how much the accuracy of decimals improves per digit.

You get  bits of information from a digit of decimal, so this means that you get  bits of information per term of a continued fraction, asymptotically.

# The difference

So what gives? Why aren't those the same??

Hint:

It's not about how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.

The Gauss-Kuzmin distribution is a marginal distribution, so its entropy is the entropy of the nth term if you don't know anything about previous terms.

The Khinchin-Lévy constant is the entropy of the conditional distribution; it measures the entropy of the nth term if you know every previous term.

These are different because -- evidently -- different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large n!

I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.

It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.

I haven't tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I'm tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome -- "the next term will probably not be 283434" -- or spread out evenly across the whole distribution? Presumably there are results on this in the literature?

But I'm still confused about why there's only a 2% difference. Why should these numbers be so close, but still different?

If you have any intuition for this, please let me know.

1. ^

If I'm remembering this right, you truncate by replacing a term  with either  or . (In other words, by rounding  down to zero or up to one.)

2. ^

The logarithm has to be base 2 to make the probabilities add up to 1. Isn't that weird?

3. ^

Formally: with probability 1, p(k) is the limit of the fraction of 's in the first  terms, as .

This part of the post originally said this was equivalent to the previous statement, but it's actually stronger.

4. ^

Yes, there's gonna a double logarithm in there.

5. ^

Just to throw another theorem on the pile: for the rational approximation you get by truncating after  terms, the th root of the denominator approaches  as . This is the square root of the number in Lochs's theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.

Why the square root? I think because the accuracy of these approximations goes like the square of the denominator -- see also Dirichlet's approximation theorem.

Lochs's theorem came late; I assume it relies on Lévy's work but I haven't seen the proof.

6. ^

One factor of  in the denominator is to make the unit "bits", and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having  in it, I believe).

New Comment