When bits of optimization imply bits of modeling: the Touchette-Lloyd theorem

Alfred Harwood; Alex_Altair

This post is about one of the results described in the 2004 paper 'Information-theoretic approach to the study of control systems' by Hugo Touchette and Seth Lloyd.^[1] The paper compares 'open-loop' and 'closed-loop' controllers (which we here call 'blind' and 'sighted' policies) for the task of reducing entropy and quantifies the difference between them using the mutual information between the controller and the environment. This post is a pedagogical guide to this result and includes discussion of it in light of the selection theorems agenda and the agent structure problem.^[2] The proof, along with more worked-out examples, can be found in the next post.

Introduction

A core feature of agents which makes them special is that they model the environment for the purpose of bringing about a particular outcome. The agent structure problem asks about the inverse; if we observe that something brings about a particular outcome, can we say that it must be modelling the environment?^[3] A while back, John Wentworth asked a related question; "How Many Bits Of Optimization Can One Bit Of Observation Unlock?"^[4] The Touchette-Lloyd theorem is one of the few results in the existing literature to touch on this problem.^[5] We think it elegantly conveys something fundamental, and want more people to know about it!

The setup

Here's how the setup of the Touchette-Lloyd theorem implements each part of the above bold sentence. The environment is a random variable which goes through one time-step of change, into random variable $Y$ . The external thing interacting with the environment is another random variable $A$ .^[6] We'll use the word 'policy' to describe the way in which $A$ responds to particular values of $X$ , as characterised by the probability distribution $P (A | X)$ . The exact way that $X$ and $A$ determine $Y$ we'll call the environment's dynamics. "Bringing about a particular outcome" is simplified to a reduction in the (Shannon) entropy between $X$ and $Y$ . And finally, "modelling the environment" is formalized as $A$ having mutual information with $X$ .

Unfortunately, the naive conjecture of "entropy reduction implies mutual information" is false. It is easy to define a particular setup^[7] where the policy $A$ has no mutual information with $X$ , and the environmental dynamics cause an entropy reduction on their own.

We will use the term 'blind policy' to refer to those where $X$ and $A$ have no mutual information and the term 'sighted policy' to refer to those where $X$ and $A$ have positive mutual information. These two families of polices can be represented by the Bayes nets in the below figure.

Bayes nets for blind and sighted policies. The directions of the arrows also show the direction of causal influence that we are assuming.

However, for any given dynamics, if a policy achieves an entropy reduction greater than any blind policy could, then the size of that additional entropy reduction is bounded by the mutual information the policy has with the environment. In other words, for each bit of entropy reduction (beyond what the environment does on its own) the policy must use an additional bit of information from the state $X$ . This is what the Touchette-Lloyd theorem says. To put this claim more formally, the actual inequality is as follows;

Δ H \leq Δ H_{B l i n d}^{M a x} + I (X; A) .

In this inequality $Δ H$ is the entropy reduction achieved by a specific policy. $Δ H_{B l i n d}^{M a x}$ is how much entropy reduction the environment can do on its own. And $I (X; A)$ is the standard mutual information between the action taken and the initial environmental state.

In this post, we will explain some of the concepts behind this theorem and introduce a few suggestive examples. Then, in the next post, we will step through the proof of the theorem and numerical examples that elucidate edge cases.

Why do we care about entropy reduction?

Though the agent structure problem is about optimization, it seems clear that there is some strong connection between reducing entropy and optimization.^[8] For example, as John Wentworth has explained, expected utility maximization can be decomposed into entropy reduction and shifting the true distribution of an environment to look more like the desired distribution.^[9]

It seems reasonably likely that mutual information between a policy and its environment is a property which is necessary (but not sufficient) for the policy to have a 'world model', whatever that is exactly. Since the Touchette-Lloyd theorem tells us that (under some circumstances) entropy reduction implies mutual information, it is a step along the way to understanding exactly how and when optimization implies a world model (and thus towards agent structure).

Analogy: noise-cancelling headphones

Consider someone attempting to block out some sounds from coming into their ears. Translating this to the Bayes net above, the source sound is $X$ , and the sound that reaches their ears is $Y$ . Anything that happens in between, like the sound bouncing off the walls, is part of the dynamics of the system. Attempted interventions are the policy $A$ . If our goal is to absolutely minimize the sound, then that's analogous to trying to minimize the entropy of $Y$ . You can do this pretty well by sticking plugs in your ears. The plugs will have no "mutual information" with the source sound $X$ , and they don't need to do any modelling of it. The ear plugs are like a policy that can do a good job at reducing the noise despite being blind (or rather, deaf).

But maybe we'd like to be more clever about reducing the noise. Perhaps some frequencies travel well through ear plugs, or you find them uncomfortable. And perhaps you'd ideally want to let some sounds through like voices, while blocking all others.

Active noise-cancelling headphones can perform these functions. While their physical structure does some passive dampening, what makes them active is that they "listen" to the incoming source sound, and then produce an inverted sound wave that cancels out the incoming sound. Doing this gives the internal state of the headphones high "mutual information" with the source sound. And as a consequence, they achieve a much better entropy reduction than what you get if, for example, you switch the noise cancelling feature off. They can also use this modelling to selectively pass desirable sounds, like voices. Though this is not a minimization of entropy, it is still a reduction in entropy. And of course, you may be playing music from the headphones which is not coming from the outside. This is introducing an entropy-increasing dynamic.

Because the theorem gives a bound on entropy reduction (rather than a statement about an optimum, like the good regulator theorem) it has something to say about all these scenarios. You can read through our numerically worked-out examples of each type of dynamics in the next post.

Formalizing the setup

As mentioned above, we will represent the initial environment state as a random variable $X$ , the 'action' taken by $A$ , and the final state of the environment by $Y$ . As is conventional for probability notation, uppercase letters represent the random variables and corresponding lowercase letters represent specific values the variables could take.

We will use the term 'policy' to describe how $A$ is chosen in relation to $X$ , as captured by the conditional probability distribution $P (A | X)$ . We will compare two broad families of policies: 'blind' policies and 'sighted' policies. For a blind policy, the value taken by $A$ will not depend on the environmental state $X$ , meaning that the probabilities of $X$ and $A$ are independent:

P (A_{B l i n d} | X) = P (A_{B l i n d}) .

The category of blind policies includes deterministic policies (such as 'pick action $a_{1}$ every time') and random policies (such as 'toss a coin, if heads, pick action $a_{1}$ , if tails, pick action $a_{2}$ ') provided that these policies do not require taking in information about $X$ . But for a sighted policy, the value of $A$ can depend on the value of X, and must be represented as $P (A_{S i g h t e d} | X)$ without simplification.

Since both $X$ and $Y$ represent states of the environment, they are typically thought of as having the same domain (ie. the same set of possible values). However, for clarity, we will still use $y$ to label output environmental values and $x$ to indicate input environmental values.

The final state of the environment will be determined by the initial state of the environment $X$ and the action variable $A$ . To keep things general we will have $Y$ depend in a probabilistic way on both $X$ and $A$ , as represented by the conditional distribution $P (Y | X, A)$ , which we will refer to as the dynamics of the system.

If $Y$ is a deterministic function of $X$ and $A$ then all of the individual probabilities $P (y | x, a)$ will either be 1 or 0.

We will assess policies based on their ability to minimize the final state entropy:

H (Y) = \sum i P (y_{i}) log \frac{1}{P (y_{i})} .

A worked-out example with a blind policy

Now we'll walk through a concrete example using this formal setting.

Imagine playing the following game. There is a computer which will randomly and secretly pick a binary string from the uniform distribution over all 5-bit strings. Without knowing which string the computer has picked, you are then required to pick your own 5-bit string and enter it into the computer. The computer then takes your string, along with its own randomly generated string and uses them as inputs for a function $f$ (which takes two arguments and outputs a single string).

Depending on what process you use to pick your string and the function $f$ the computer uses, the probability distribution over the function's outputs is then calculated. From this, the entropy of the output distribution is calculated:^[10]

H = \sum s t r i n g s p (s t r i n g) log \frac{1}{p (s t r i n g)} .

The aim of the game is to choose your strings so as to minimize this output entropy. If you know what function the computer will use, how should you pick your string?

Our game example imposed onto the above Bayes net. The initial binary string (red) is randomly generated. A sighted policy (green) will be able see some of the string, and must decide what to pass to the computer. The computer then uses both strings as input to decide on an output (yellow).

Let us consider the following function:

f (x, a) = {\begin{matrix} 00000 if a = x x otherwise \end{matrix}

In words: if the string you entered was the same as the string that the computer picked, it will output the all zeros string '00000'. Otherwise, it will output the string which it initially picked.

How should we choose our string so as to minimize the entropy of the output distribution? After a bit of thought it becomes clear that the best strategy is to pick any string except the all zeros string. Most of the time, it will output whatever it randomly generated. If you pick $a = 00000$ , then it will always output whatever it randomly generated, meaning the output distribution will be uniform (which has maximum entropy). But if you choose any other value for $a$ , then in the (unlikely but possible) case that your choice matches the computer's choice, it will output the all zeros string instead. This increases the overall probability of the computer outputting all zeros and decreases the probability of the other string you pick. This makes the output distribution slightly non-uniform, lowering its entropy.

Since the situation is symmetric with respect to all strings except 00000, we can assume you pick a fixed string. For simplicity lets assume you pick the 11111 string. Employing this strategy will give an output entropy of ~4.94 bits which is slightly better than 5 bits, the maximum entropy possible.^[11]

It is fairly straightforward to extend the analysis to cases where initial distribution over strings is non-uniform. In these cases, the best strategy is to choose the highest probability string which is not the all-zeros string.

A worked-out example with a sighted policy

However, it is clear that having to choose your strings 'blind' (ie. knowing nothing about what string the computer has chosen) prevents you from making the output entropy radically lower than the initial entropy.

If you were able to see the full string the computer had chosen before you had to choose your string, you could simply pick your string to perfectly match it, resulting in a zero entropy output distribution.

Suppose instead you are not given the full string, but you are given a small amount of information about the string chosen by the computer. For example, if we were given only the first bit of the computer's string, this would halve the number of possible strings it could be and double our chance of guessing correctly (for a uniform distribution). We could take advantage of this by guessing the string 11111 if the first bit of the computer string is 1 and 01111 if it is a 0. This strategy would give you a probability of $\frac{1}{2^{4}}$ of guessing the correct string and if applied would give an output entropy of 4.85 bits.

We can improve the strategy further if we are permitted to view more bits. If we are given the first two bits of the computer's string we can adopt the strategy where we guess the string $* * 111$ (where ' $* *$ ' represents the first two bits which we have seen). Our guesses will become even more accurate and we can achieve an output entropy of 4.63 bits.

As we increase the number of bits we can see, this pattern continues, all the way down to zero entropy when we know all five bits of the string.

Clearly, knowing more bits allows you to do better. The difference between the best 'blind' strategy (which knew 0 bits of the computer's string) and the optimal strategy knowing the computer's full string is 4.94 bits of entropy. Each bit of information buys us some extra entropy reduction.

Recall that we restricted the form of the function $f$ and reserved the right to change it later. How much of our above analysis of this problem depends on this choice of $f$ ? Can we still make similar claims when we change $f$ ? What if, instead of a deterministic function, $X$ and $A$ influenced $Y$ through nondeterministic dynamics? Similarly, we assumed an initial distribution which was uniform over a support of binary strings of length 5. How much of our analysis depended on this assumption? How much extra optimization can one extra bit of mutual information buy us?

To investigate these questions, let's first generalize the concepts of 'blind' and 'sighted' policies.

Mutual information

We introduced blind policies as policies where the action taken $A$ is independent from the environmental state $X$ . By enforcing this independence, we are ensuring that the policy cannot be using any information about the environment to choose its action. Conversely, a sighted policy is one where $A$ and $X$ are not independent.

However, this doesn't tell us how much information a sighted policy is using. Independence is a binary property. To measure this quantity, we can use the mutual information between the two random variables, denoted by $I (X; A)$ .

Blind policies

Now we can define a blind policy as one where the mutual information between $A$ and $X$ is zero. Knowing what action a blind policy took gives you no information about what the environmental state was (and vice versa). This is mathematically equivalent to our previous definition of a blind policy as being one where the action taken is independent of the environmental state. That is,

P (A_{B l i n d} | X) = P (A_{B l i n d})

if and only if

I (X; A_{B l i n d}) = 0 .

Sighted policies

Conversely, a sighted policy is one where the mutual information between $X$ and $A$ is non-zero. If $A_{S i g h t e d}$ is the action variable for a sighted policy then:

I (X; A_{S i g h t e d}) > 0 .

The difference is that now we have a measure of how 'sighted' the policy is.

For example, in the previous section the environment variable was described by a 5-bit string and we considered strategies with 0, 1, 2, 3, 4 and 5 bits of information about the environmental state. Imagine that you are observing the actions taken by a policy playing the game described in the first section. In this case the 'environmental variable' is the 5-bit string randomly chosen by the computer. Suppose you know that the policy has access to the first 2 bits of the environmental string and will be taking the action of choosing the string $* * 111$ where ' $* *$ ' indicates the first two bits of the environmental string which are known to the policy. If you observe in one instance that the string chosen by the policy is '10111' but you do not know the environmental string, you can use this information to work out that the first two digits of the environmental string must be '10'. While the unconditional entropy of the environmental string is 5 bits ( $H (X) = 5$ ), knowing the action taken by the policy reduces this uncertainty to 3 bits ( $H (X | A) = 3$ ). Since mutual information is given by $I (X; A) = H (X) - H (X | A)$ there are 2 bits of mutual information shared between the action and the environment. In this way mutual information matches our intuitive notion of what it means for a policy to 'know' some number of bits about the environmental state.

For convenience, we repeat the Bayes net diagram from above.

Mutual information doesn't always help

It is important to note that sharing mutual information with the environment is not sufficient to imply that a policy will reduce entropy more than a blind policy. This may be due to the policy just throwing away the information, or to the nature of the environmental dynamics being 'too random'. For example, in an environment which is totally unaffected by the choice of action, having mutual information between the action taken and the environmental state has no effect on the final distribution, so all blind and sighted policies will perform equivalently. Alternatively, actions may have mutual information with the environmental state, but use it in a way which actively reduces their performance in entropy reduction. In the 'guess the bits' example from earlier, a policy with access to all five environmental bits could be 'always choose the string which is the bitwise-NOT of the environmental string' (ie. if the environmental string was 01100, the policy would choose the string 10011). This policy would result in no reduction of entropy and would perform worse than the optimal blind policy, even though the action would have the maximal 5 bits of mutual information with the environmental state.

What does the theorem say?

As we stated (but didn't prove) above, the Touchette-Lloyd theorem says that,

Δ H \leq Δ H_{B l i n d}^{M a x} + I (X; A) .

Let's summarize what each of these terms means.

$Δ H = H (X) - H (Y)$ is the change in entropy between the input $X$ and the output $Y$ , achieved by some particular policy.
$Δ H_{B l i n d}^{M a x}$ is the maximum entropy change achievable for any blind policy over any input distribution. More formally, we can write it as:

Δ H_{B l i n d}^{M a x} := max P (X) \in X, P (A) \in A Δ H,

We'll explain this term more in the next post, but here $P (X)$ is the initial distribution over $X$ , and $X$ is the set of all possible distributions over $X$ . Similarly, $P (A)$ is the distribution over actions corresponding to a blind policy and $A$ is the set of all possible distributions over actions corresponding to blind policies.

$I (X; A)$ is the mutual information between $X$ and $A$ for the policy which results in an entropy change $Δ H$ .

We can interpret this inequality as a 'selection theorem' of the loose form, "optimization implies mutual information with the environment". If, for some initial distribution, we observe a policy which achieves an entropy reduction $Δ H$ which is greater than $Δ H_{B l i n d}^{M a x}$ then we can deduce that it must have at least $Δ H - Δ H_{B l i n d}^{M a x}$ bits of mutual information with the environment. We think that that this is a pretty nice result.

In the next post, we will go through the proof of the theorem as well as some numerical examples to really clarify how this theorem works.

^{^}
In particular, we discuss Theorem 10 from this paper. The result is also described in the paper 'Information-Theoretic Limits of Control' published in the year 2000. It also appeared earlier in the 1989 paper 'Use of mutual information to decrease entropy: Implications for the second law of thermodynamics' (paywalled) by Lloyd, which uses a more physics-based proof. In this post we rely primarily on the paper from 2004.
^{^}
An idea discussed in an early post by Scott Garrabrant, elaborated on in a later post by John Wentworth, and recently written about by the authors in this post.
^{^}
The agent structure problem also asks this about the other features of agents, like a value representation and a planning process.
^{^}
While observations and models are different, they're tightly related. Models almost always come from observations, and often models are regularly updated in response to continued observations.
^{^}
Some other work in this vein includes the good regulator theorem (along with John Wentworth's work on 'fixing' it) and the internal model principle.
^{^}
You could think of $A$ as standing for action or agent.
^{^}
See the sequel to this post for a worked-through numerical example.
^{^}
In Measuring Optimization Power, Yudkowsky describes a tight connection between entropy and optimization.
^{^}
From Utility Maximization=Description Length Minimization: "we can take any expected utility maximization problem, and decompose it into an entropy minimization term plus a 'make-the-world-look-like-this-specific-model' term."
^{^}
As is conventional in information theory, all logarithms in this post are base 2.
^{^}
$p (o u t p u t = 11111) = 0. p (o u t p u t = 00000) = \frac{2}{2^{5}} = \frac{1}{2^{4}}$
$p (o u t p u t = any other string) = \frac{1}{2^{5}}$
$\frac{1}{2^{4}} log 2^{4} + (2^{5} - 2) \frac{1}{2^{5}} log 2^{5}$ = 4.94

32