Dealing with infinite entropy

Alex_Altair

A major concession of the introduction post was limiting our treatment to finite sets of states. These are easier to reason about, and the math is usually cleaner, so it's a good place to start when trying to define and understand a concept.

But infinities can be important. It seems quite plausible that our universe has infinitely many states, and it is frequently most convenient for even the simplest models to have continuous parameters. So if the concept of entropy as the number of bits you need to uniquely distinguish a state is going to be useful, we'd better figure out how to apply it to continuous state spaces.^[1]

For simplicity, we'll just consider the case where the state space is a single real-valued parameter in some interval. Because $x$ is real-valued, its set of possible values (the "states") is infinite. Extending the results to more complex sets of states is the domain of other fields of mathematics.

The problem

If you were to play the Guess Who? game with an increasing number of cards, you would need an increasing number of questions to nail down each card, regardless of what strategy you were using. The number of questions doesn't go up very fast; after all, each new question lets you double the number of cards you can distinguish, and exponents grow very fast. But, in the limit of infinite cards, you would still need infinitely many questions. So that means that if we use the yes/no questions method to assign binary string labels to the cards, we would end up with each card being represented by infinitely long strings, and therefore having infinite entropy. It ends up being equivalent to playing "guess my number" instead, where your friend could have chosen any real number between, say, 0 and 1.

To map the Guess Who? cards to a continuous parameter: 1) Line them up under the relevant interval of the $x$ -axis. 2) Come up with a set of yes/no questions that can uniquely identify each card. 3) Come up with a new question for which all the current cards would be a "no". 4) Double the number of cards and modify all the new cards to be a "yes" answer to the new question. 5) Iterating this process to infinity will eventually "fill up" the interval, and will consequently generate infinitely many bits per card.

This doesn't seem very useful. But it still feels like we should somehow be able to use the concept of entropy (that is, the concept of using bits to distinguish states). If your friend tends to choose numbers on the left way more often than the cards on the right, then it still seems like it should take fewer questions to "narrow in" on the chosen number, even if we can never guess it exactly.

One way we could still make use of it is to use finitely many bits to distinguish among sets of states (dropping the requirement to uniquely distinguish individual states). After all, in the real world, any measurement of a continuous parameter will return a number with only finite precision, meaning that we can only ever know that our state is within a set compatible with that measurement. That set will have infinitely many individual states, but in practice we only care about some finite measurement precision. This path of investigation will take us somewhere interesting.

A second way to handle the infinities would be to maintain unique binary strings for each state, and just see what happens to the equations as we increase the number of states to infinity. When the probability distribution is not uniform, is there a useful sense in which some of the infinities are "bigger" than others? The history of mathematics has a long tradition of making sense out of comparing infinities, and we can apply the tools that people have previously developed.

That said, really nailing these ideas down has considerable mathematical detail (and therefore considerable subtlety in exactly what they mean and what statements are true). This level of detail tends to make the conceptual content substantially less clear, so I've saved it for a future post. For now, I'd like to give you a conceptual grasp that is just formally grounded enough to be satisfying.

The analogy with probability distributions

To understand the first strategy, I think it's very useful to remember exactly how continuous probability distributions work, and then extend that mechanism to entropy.

When you first learn about probability, you typically start with discrete events. The coin flip lands on heads or tails. The die comes up on some integer between 1 and 6. There are simply finite sets of outcomes, each with a finite probability, and all the math is nice and clear. But pretty soon, you need to start being able to handle probabilities of events that are distributed smoothly over intervals. You're flipping a coin repeatedly, and trying to use the heads/tails ratio to figure out how fair it is. You throw a dart at a board and want to know the chance that you'll get within the bullseye.

But if you model the dart board the normal way, as a circle with a certain diameter, then it contains infinitely many points inside it, each of which has a 0% chance of being hit with the dart. But the dart will hit the board somewhere; what gives?

Mathematicians dealt with this by well-defining infinitesimal probabilities. They come up with a continuous function $p (x)$ over the range of all possible outcomes, and they say that $P (a < x < b) = \int_{a}^{b} p (x) d x$ , which is to say, the probability that the outcome $x$ is between $a$ and $b$ is the area under the curve (the "probability mass") between $a$ and $b$ . These objects must obey certain axioms, et cetera. If you'd like, you can consider $p (x)$ to be the limit of a sequence of discrete probability distributions $P_{n}$ .^[2]

The vertical red lines represent finite probability masses of the discrete distribution. As we increase the number of them, they converge to the continuous $p (x)$ . Note that the scale of $P_{n}$ is being changed on each iteration to visually align with $p (x)$ ; in fact the values of $P_{n} (i)$ are going to 0.

If you work with continuous probability distributions a lot, then they feel totally normal. But the whole idea behind them is quite a brilliant innovation! The fact that every specific value has infinitesimal probability can sometimes bother people when they're first learning about it, but ultimately people are content that it's well-defined via ranges. (Largely because, in practice, we can only take measurements with a certain range of precision.) You can also use it for comparisons; even though $P (x = a) = 0$ and $P (x = b) = 0$ , it's still meaningful to say that $\frac{P (x = a)}{P (x = b)} = \frac{p (a)}{p (b)}$ ,^[3] and since $p$ is always nice and finite, that ratio will also be a nice finite number. You can use $p$ to tell, for example, when $a$ is twice as likely as $b$ .

Differential entropy

In a similar way, we can construct "continuous entropy distributions", that is, a continuous function that tells you something about how you can compare the entropy at individual values. Let's say that our states, represented by values on the $x$ -axis, occur according to the (continuous) probability distribution $p (x)$ . We will assign binary strings to them in the following way.^[4] Say that we want to figure out the binary string that belongs to a specific value $x *$ . First, we find the value on the $x$ -axis that cuts the probability mass of $p$ in half; we'll call this value $x = a_{1}$ . If $x * < a_{1}$ , that is, if $x *$ is to the left of $a_{1}$ , then the first bit of $x *$ is 0. If $a_{1} < x *$ , then its first bit is a 1. Then we iterate. For whichever side $x *$ is on, we find the next value, $a_{2}$ , which cuts that probability mass in half again (now partitioning it into areas of 1/4). We do the same bit assignment based on whether $x *$ is less or more than $a_{2}$ . We can iterate this process for as long as we want, and continue getting more bits to assign to $x *$ . This is our scheme for using finitely many bits to distinguish among sets of states.

Here we see 5 iterations of the process described above. The resulting 5 bits lets us distinguish a subset of states with width $a_{5} - a_{3}$ , containing $x *$ .

For any two different specific values of $x$ , they will eventually be on different sides of some $a_{n}$ , and thus will eventually have different binary representations. And because the probability mass is smoothly distributed over the real line, every specific $x$ will require an infinitely long binary string to locate it perfectly. But to get within a small distance $ϵ$ of $x$ , we only need a finite number of bits. And furthermore, given a fixed $ϵ$ , the number of bits we need is different for different $x$ s. If $p (x)$ is large, then it will have a lot of probability mass in its immediate vicinity, and we will require less bits (=mass halvings) to get within $ϵ$ , and inversely if $p (x)$ is small, we will require more. Can we use this to get a finite comparative measure?

Of course, we can't really just pick a specific $ϵ$ like 0.0001 and then use the resulting number of bits for our comparison, because for any $ϵ$ we might pick, there exist perfectly well-behaved functions that happen to still have large fluctuations within $ϵ$ of $x$ . It wouldn't be fair to "credit" $x$ with those probability mass fluctuations. Instead, we want to get a measure that only relies on the value exactly at $x$ .

Let's continue using the analogy with probability distributions. If, for some specific distribution $p$ , we have that $p (1.48) = 0.25$ , then what does that mean? Well, one thing you could say that it means is the thing about ranges and limits that I described in the above section. But another, equivalent thing you could say is that it means that events in the region of $x = 1.48$ are as unlikely as any event in a uniform distribution $p_{u}$ that is equal to $p_{u} (x) = 0.25$ everywhere. Since the total area still must be 1, we can conclude that the domain of $p_{u}$ must be $\frac{1}{0.25} = 4$ units wide. This kind of comparison might give you an intuition for how unlikely the event $p (1.48)$ is, even if $p (x)$ is some complicated function overall; it's like picking a random real number between 0 and 4.

The value $x = 1.48$ is equally unlikely for both $p$ and $p_{u}$ .

Similarly, if $p (1.48) = 0.25$ , what does that say about how many bits we need to pinpoint $x = 1.48$ within a range of $ϵ$ ?

Well, in the limit of $ϵ$ going to zero, it's the same number as if the distribution was $p_{u} (x) = 0.25$ . Since that means that the probability mass would be spread over an interval 4 units long, we'd need to halve it twice to get to 1 unit around $x$ . If we had $p_{u} (x) = 0.5$ , then the interval would be 2 units long, and we'd need to halve it once to get within 1. At $p_{u} (x) = 1$ , it would be 1 unit long, and we'd need to halve it 0 times. At $p_{u} (x) = 2$ , it would be half a unit long, and we'd need to halve it $- 1$ times.

The generalization of those is $log \frac{1}{p (x)}$ times. And since $p_{u}$ is uniform, it's okay that we only narrow it down to a range of length 1; there are no fluctuations inside that range. It's the same as the limiting behavior of $p (x)$ , but now we can safely pick a fixed $ϵ$ , and we might as well pick $ϵ = 1$ .

Because it only pinpoints $x$ to within a range, this isn't the entropy of $x$ (in the same way that $p (x)$ is not the probability of $x$ ). So let's call it the "differential" entropy, and give it a lowercase letter^[5] (like how we sometimes use $p$ instead of $P$ );

h (x) = log \frac{1}{p (x)}

And then the expected value is just

⟨ h ⟩ = \int_{X} p (x) log \frac{1}{p (x)} d x

which could be considered the differential entropy of all of $p$ , rather than just of $x$ .

In this sense then, the differential entropy of a continuous-valued $x$ is the number of bits you need to specify a range of size 1 containing $x$ . To complete the analogy with probability distributions:

$P (x = a) = 0$ , but $P (x = a \pm ϵ) = p (a) \cdot ϵ$
$H (a) = \infty$ , but $H (a \pm ϵ) = log \frac{1}{P (x = a \pm ϵ)} = log \frac{1}{p (a) ϵ} = h (a) + log \frac{1}{ϵ}$ .

This quantity seems quite useful to me. It's conveniently not a function of anything other than the value of $p$ at $x$ . As a consequence though, it might not be quite the right comparison in some occasions. Let's say you have two distributions, $p_{1} (x)$ defined on $[0, 1]$ and $p_{2} (x)$ defined on $[0, 2]$ . Perhaps $p_{1}$ is uniform, but $p_{2}$ is non-uniform in a way that gives it exactly the same $⟨ h ⟩$ as $p_{1}$ . The value of $⟨ h ⟩$ is sort of mushing together these two factors, and cancelling them out in a certain way. Perhaps we'd like to better understand their individual contributions.

Looking at the naive limits

Now let's consider a second strategy for dealing with the entropy of infinite sets. What does happen if we just plow ahead and take the limit as the set size goes to infinity? The detailed post contains the derivations; here I'll just give you the setup and the conclusions.

What we want is to extend the notion of entropy to a continuous probability distribution $p$ defined over an infinite set of $x$ s. So, as mentioned above, we'll say that $p$ is defined as the limit of discrete probability distributions $P_{n} (i)$ , where the $i$ s pack into the same range that the $x$ s run over. Since our entropy is already defined for discrete distributions, we can simply consider the limit of the entropy of those distributions.

The notation here will be dissatisfying. These derivations have a ton of moving parts, and devoting notation for each part leads to a panoply of symbols. So consider all these equations to be glosses.

For the entropy of a single state, the limit looks like this;

H (x) = lim n \to \infty log \frac{1}{P_{n} (i_{x})}

We've switched to a capital $H$ because this is a proper entropy, and not a differential entropy. $P_{n}$ is the $n$ th discrete distribution, so it changes at each step in the limit. There's a lot of machinery required to specify how the limit works, but conceptually, what matters is that $P_{n}$ converges to $p$ as discussed above. That means that the number of states that $P_{n}$ assigns a chunk of probability mass to must increase, and that said states must eventually "fill in" the part of the $x$ -axis that $p$ is defined over. For any given $P_{n}$ , we don't need it to put one of its mass chunks exactly on our specific $x$ , so the $i_{x}$ is intending to connote that we're talking about whatever chunk is nearest to our specific $x$ .

Because it's discrete, each value of $P_{n}$ is finite. But since the total probability mass is getting divvied up between more and more states, each value gets smaller, and eventually goes to zero. Then our limit is effectively $log \frac{1}{0} = log \infty = \infty$ , which tells us what we knew in the beginning; the number of bits needed to pinpoint one state is infinite. The expected value also diverges;

H (p) = lim n \to \infty n \sum i = 0 P_{n} (i) log \frac{1}{P_{n} (i)} = \infty .

(Although that's less clear from just looking at it: derivation in the longer post).

However, once you pick a specific sequence of $P_{n}$ s and a specific strategy for placing the $x_{i}$ s, then the limit can always be naturally manipulated into this form:

H = h + M + lim n \to \infty log n .

Each of these terms has a semantic interpretation;

$h$ is the differential entropy we discussed above, and so we've already given it a meaning! Also, it's only a function of $x$ , and it's finite.^[6]
$M$ is a term that depends on how exactly you take the limit (but not on $p$ ^[6]) and is also finite.
and ${lim}_{n \to \infty} log n$ is exactly what it looks like; a diverging limit that depends only on $n$ .

I find this decomposition to be particularly elegant and satisfying!

Here's what really fascinates me about it; since $h$ and $M$ are both finite, all of the divergence is contained in the last term, which is independent of both $p$ and the details of our limit-taking!

This makes it really easy to compare the entropy of two continuous distributions; you can just look at $h$ and $M$ !

H_{1} - H_{2} = h_{1} - h_{2} + M_{1} - M_{2}

Simply evaluating $H_{1} - H_{2}$ ^[7] will tell you how many more bits it will take to uniquely distinguish a value from the first distribution compared to the second distribution, even though it still takes infinite bits to represent each individual $x$ .

And if you take the limit in the same way, then the $M$ s cancel out as well. Looking back at our definition of $h$ , if you're concerned with individual values of $x$ , then

H_{1} (x) - H_{2} (x) = h_{1} (x) - h_{2} (x) = log \frac{1}{p_{1} (x)} - log \frac{1}{p_{2} (x)} = log \frac{p_{2} (x)}{p_{1} (x)} .

And if you're looking at expected values, then – well, then you encounter an interesting subtlety, which I will save for the detailed post (or perhaps as an exercise for the reader).

Conclusion

Now that we've reached the end, it's worth reflecting on what this all means. So we've found a way to finitely compare states with infinite entropies; does that really make it a valid model of the real world?

Here's how I see it. Imagine a world where all the people live in the tops of skyscrapers. They've never seen the ground. They don't know that there's no ground, that the skyscrapers have "infinite" height. But as many stories as anyone has ever gone down, they still look out the window and see the walls receding below them.

So it at least seems plausible that the skyscrapers are all infinitely tall. Despite that, they still clearly have their top floors at different heights! If you're on the roof of one skyscraper, you can see that some others' roofs are lower, and some are higher. Even though they don't know where the bottom is, it's still valid to ask how much taller one building is than another, or how many stories you have to climb to get to another floor. The people can agree to define a particular height as the zeroth story, and number all the other stories with respect to that one.

In our world, we can tell that physical properties can differ across extremely small distances. We can measure down to a certain precision, but who could ever say when we hit the "bottom"? And yet, we can still ask whether two states differ within a given precision, or how many bits it would take to describe a state to within a given precision. We can agree to define a given distance as the unit meter, and measure things with respect to that.

Their skyscraper stories are the exponent of our meters. Their stories are our bits of information.

^{^}
There is a way in which you can have an infinite set of states, each of which has finite entropy; after all, there are infinitely many binary strings of finite length. This, however, is a countable infinity. In this post we will be talking about intervals of the real numbers,^[8] which means we're talking about uncountably infinitely many states. There are uncountably many binary strings of (countably) infinite length, which is why this works at all. (Technically, given an uncountable set, you could go ahead and assign all the finite strings to a countable subset, which would mean that some of the states would get finite entropy; but that's not very useful, and would violate some of our other premises.)
^{^}
This animation represents one possible way to define a set of $P_{n}$ which converge to $p$ , where the point masses are equally spaced with different mass. Another option would be to have them equally massed at $P_{n} (i) = \frac{1}{n}$ , but denser where $p$ is greater. Intermediates are also possible.
^{^}
This works out because, for sufficiently well-behaved $p$ , we could say that
$\frac{P (x = a)}{P (x = b)} := lim ϵ \to 0 \frac{P (a - ϵ < x < a + ϵ)}{P (b - ϵ < x < b + ϵ)} .$
By the definition of how $p$ is constructed, this is equal to
$= lim ϵ \to 0 \frac{\int_{a - ϵ}^{a + ϵ} p (x) d x}{\int_{b - ϵ}^{b + ϵ} p (x) d x} .$
And because of how integrals work, in the limit this simplifies to
$= lim ϵ \to 0 \frac{2 ϵ \cdot p (a)}{2 ϵ \cdot p (b)} = lim ϵ \to 0 \frac{p (a)}{p (b)} = \frac{p (a)}{p (b)} .$
^{^}
This is, of course, only one possible way to use bits to distinguish states. But it's a way that I think is easier to follow, and, assuming that the "true" distribution of $x$ is $p$ , this will also be the bit assignment that minimizes the entropy.
^{^}
The letter $h$ or $H$ is commonly used for entropy in information theory, whereas $S$ is more common in physics.
^{^}
A couple caveats; if you're using the expected value of $H$ , then $h$ and $M$ are of course also a function of $p$ . And there do exist pathological $p$ s where $h$ is infinite.
^{^}
In a similar fashion as footnote 3, what we're actually doing here is an outside limit. Instead of this,
$H_{1} - H_{2} = (lim n \to \infty log \frac{1}{P_{n 1} (i_{x})}) - (lim n \to \infty log \frac{1}{P_{n 2} (i_{x})})$
we mean this;
$H_{1} - H_{2} := lim n \to \infty (log \frac{1}{P_{n 1} (i_{x})} - log \frac{1}{P_{n 2} (i_{x})})$
This latter expression is what can be formally manipulated into $h_{1} - h_{2} + M_{1} - M_{2}$ . This is essentially saying that as long as we take the limit in the same way for both probability distributions, then the infinities cancel out.
^{^}
I can't resist adding another mathematical disclaimer here. The purpose of this post is to allow us to apply entropy to realistic models of the physical world. Typically, people use the real numbers for this. But another option could be to use sets that are simply dense, meaning that all points have other points arbitrarily close to them. This set could be countable (like the rationals), which means we could use only finite binary strings for them. However, if you try to stuff all the finite binary strings into a dense interval like this, then you lose continuity; if some $x$ is given a string of length 6, zooming in closer to the $ϵ$ -neighborhood around $x$ will reveal states with arbitrarily long strings, rather than strings whose length gets closer to 6. (This is similar to how, the closer you zoom into $x = \frac{1}{2}$ , the larger the denominators of nearby rationals will be.) Realistic models of the world will assume continuity, ruling out this whole scheme.

76