This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF.

Introduction

This is the second part of a two posts series explaining the Internal Model Principle and how it might relate to AI Safety, particularly to Agent Foundations research. In the first post, we constructed a simplified version of IMP that was easier to understand and focused on building intuition about the theorem's assumptions. In this second post, we explain the general version of the theorem as stated by Cai&Wonham^[1] and discuss how it relates to alignment-relevant questions such as agent-structure problem and selection theorems.

We present the more general setup and explain Feedback and Regulation conditions. With those, we prove a first result. Then, we discuss the meaning of this first result and add another condition called "Observability" to derive another result. After this, we provide a worked out example and discuss possible extensions of the theorem.

Setup

The IMP from Cai&Wonham^[1] originally models the following control theory situation:

Suppose there's, for example, a chemical powerplant in a given environment. The temperature of the plant should be constant and the plant's temperature is affected by the environment's temperature, since they exchange heat, so there is a temperature system made of heaters and air conditioning systems inside the plant, controlled by a controller. This controller controls this systems: it receives information from the environment's temperature and passes a signal to the temperature inside the plant to counter-act the effect of the environment:

If the environment is hot, air conditionings are turned on.
If the environment is cold, heaters are turned on.

Naturally, one thing that could happen is that the controller could receive information from the environment and then, after that, control the temperature system. The theorem shows that, under specific circumstances, the controller actually "foresee" the environment's temperature and control the plant temperature concurrently to the environment's temperature changing. We use "foresee" because what the theorem really shows is that, under some circumstances, the controller is autonomous (i.e, it doesn't use outside information to decide how to control the temperature system) and it controls the temperature system in a way that counter-acts the environment's effect (even without using information from the environment).

In Cai&Wonham's^[1] book,

"Environment (E), Controller (C) and Plant (P) are dynamic systems which generate, via suitable output maps, the reference, control and output signals respectively. The objective of regulation is to ensure that the output signal coincides (eventually) with the reference, namely the system ‘tracks’. To this end the output is ‘fed back’ and compared (via ) to the reference, and the resulting tracking error signal used to ‘drive’ the controller. The latter in turn controls the plant, causing its output to approach the reference, so that the tracking error eventually (perhaps as $t \to \infty$ ) approaches ‘zero’. Our aim is to show that this setup implies, when suitably formalized, that the controller incorporates a model of the Environment: this statement is the Internal Model Principle."

We model the world as any set $X$ . For example, we could consider the whole world as $X = E \times C \times P$ , comprised of environment, controller and plant. We assume there's a discrete and deterministic world dynamics $α : X \to X$ that specifies how the world evolves.

We could also consider $P$ and $C$ as a single joint system called internal system (or just controller), and then we'd have $X = E \times C$ (we're calling $C \times P$ just $C$ ).

We could consider, for example, $X = E \times C \times P$ (left) or $X = E \times C$ (right)

We say a system is autonomous if its next state depends only on the previous state. For example, if $A$ and $B$ are sets, then the function $f : A \to A$ is autonomous and the function $g : A \times B \to A$ is not. We'll just say "A is autonomous" as a slight abuse of terminology. We'll usually write $X_{S} \subset S$ to denote that $X_{S}$ is an autonomous system inside $S$ - while $S$ don't need necessarily to be autonomous, for $S = E, C, P$ . We assume the environment is autonomous (intuitively, because we can't control the environment, it's a given) with update rule $α_{E} : X_{E} \to X_{E}$ , where $X_{E} \subseteq E$ . We assume, however, the environment is always autonomous, so $E = X_{E}$ .

We'll denote by $K \subset X$ the set of good states (that is, the set we want our system to be in after a large number of time-steps). It could be, for example, states where the plant's temperature is constant, for any environment state. A important thing to notice is that $K$ is not necessarily the set of states such that the system converges to, because it need not be $α$ -invariant - that is, we usually say a system converges to a part of it if it remains in that part after convergence - and it may not be the case that for $x \in K, α (x) \in K$ - we haven't added any specific condition ensuring that. We don't care about what happened "prior to convergence", so all the states and assumptions we care about will be true on $K$ .

When we consider $X = E \times C$ , the set of good states $K \subseteq X$ is a region on the $E \times C$ -plane. Note that $K$ is not necessarily a rectangle and it can be any region of the plane.

Since $K \subset X$ , we can project states out of $K$ . For example, if $X = E \times C \times P$ , we can write the canonical projection $π_{C} : E \times C \times P \to C$ given by $π_{C} ((x_{E}, x_{C}, x_{P})) = x_{C}$ . More generally, we could have any function $γ : X \to C$ acting as "projection". We define $X_{C} := γ (X) \subseteq C$ , that is, $X_{C}$ 's elements are controller states that we get from projecting good states to controller states. Recall that earlier we said we would use the " $X_{S}$ " notation to denote autonomous systems, and one of the conclusions of the theorem is that $X_{C}$ will be autonomous - i.e, we will be able to define a function $α_{C} : X_{C} \to X_{C}$ that is also interesting to us. At last, note that, by definition, $γ : X \to X_{C}$ is surjective.

So until now in our setup we have:

$X$ world, $α : X \to X$
$X_{E}$ environment, $α_{E} : X_{E} \to X_{E}$
$K \subseteq X$ set of good states
$γ : X \to X_{C}$ , $X_{C}$ denoting all the controller states we're interested in

At this point, the reader might be tempted to ask two things:

Ok, the controller states are obtained through the world via a "projection" $γ$ , but what about the environment states?
Is it always true that the environment and world dynamics are consistent with each other? For example, if $X = X_{E} \times X_{C}$ , it could be that $π_{E} (α (x_{E}, x_{C})) \neq α_{E} (π_{E} (x_{E}, x_{C}))$

To answer these questions, we'll first play around with the $X = X_{E} \times X_{C}$ setup, and then arrive at a property that can be abstractly imposed to other setups.

If $X = X_{E} \times X_{C}$ , we have that $X_{E} = π_{E} (X)$ . Here, it's not true that $X_{E} \subseteq X$ , but it's obvious that $X_{E}$ is "a part" of $X$ or "like a subset" of $X$ . We want a general property to relate $X_{E}$ and $X$ such that $X_{E}$ is "like a subset" of $X$ . In mathematics, we use the notion of insertion to work with this. If $A \subset B$ , one can trivially define an injection between $A$ and $B$ (the identity, for example). If $A ⊈ B$ , we think of $A$ as "like a subset" of $B$ if one can define an injective function $i : A \to B$ . So in general we can just ask for $X_{E}$ and $X$ to be such that there's an injective function $i_{E} : X_{E} \to X$ . This partly answers question 1. Now, we'll look closer at insertions to understand if any insertion is allowed.

(For the following, if you're not familiar with equivalence relations and partitions, go read this section of my first post)

Note that $π_{E} : X \to X_{E}$ is not necessarily invertible, but $π_{E} : X mod ker π_{E} \to X_{E}$ is a bijection (exercise for the reader), so invertible.

Each $x_{E}$ point determines a line $[x_{E}] = {(c, e) \in X; π_{E} (c, e) = x_{E}}$ . Then, we can define $π_{E}^{- 1} (x_{E}) = [x_{E}]$

$X mod ker π_{E}$ is a set containing lines parallel to the $X_{E}$ -axis, so while $π_{E} : X \to X_{E}$ maps points on $X$ to points on $X_{E}$ , $π_{E} : X mod ker π_{E} \to X_{E}$ maps lines to points on $X_{E}$ . The inverse $π_{E}^{- 1} : X_{E} \to X mod ker π_{E}$ takes a point in $X_{E}$ and map it to the specific parallel line such that, when projected over $X_{E}$ , it reaches $x_{E}$ . Hence, if we want to define an insertion $i_{E} : X_{E} \to X$ , we can do the following:

Given $x_{E}$ , compute $π_{E}^{- 1} (x_{E})$ , this is a line.
Choose any point in the line and define it to be $i_{E} (x_{E})$ .

By this procedure, if $X_{E} = R$ , for example, we can define an uncountable amount of insertions between $X_{E}$ and $X$ .

Note, however, that we want the environment to be part of the world, and thus to evolve consistently with the world. For a given $x_{E} \in X_{E}$ , we have two ways of producing updates of the world now:

Compute $i_{E} (x_{E}) \in X$ , then compute $α (i_{E} (x_{E})) \in X$
Compute $α_{E} (x_{E})$ , and then $i_{E} (α_{E} (x_{E})) \in X$

They don't necessarily need to be the same.

Thus, to answer question 1, we ask that $X_{E}$ and $X$ are such that there's $i_{E} : X_{E} \to X$ injective.

To answer question 2, we ask that this injection $i_{E}$ is such that $i_{E} (α_{E} (x_{E})) = α (i_{E} (x_{E})), \forall x_{E} \in X_{E}$ .

We define ${~ X}_{E} := i_{E} (X_{E})$ .

Note that, if $X_{E} \subseteq X$ , this is the same as asking $α |_{X_{E}} = α_{E}$ , so deep down we're asking something related to $X_{E}$ being $α$ -invariant. Indeed, ${~ X}_{E}$ is $α$ -invariant because:

if $x \in {~ X}_{E}$ , then $x = i_{E} (x_{E})$ and $α (x) = α (i_{E} (x_{E})) = i_{E} (α_{E} (x_{E})) \in {~ X}_{E}$

Assumptions

Feedback Condition

We want to state something that enables us to proof the controller is autonomous and it models the environment. In the first post, we argued that one way to ensure that is by asking the feedback condition:

ker γ \leq ker γ \circ α

We want to be able to model a situation where the controller, prior to convergence, needs information from the environment to "learn" how the environment works. Then, after a large number of time-steps, the controller learns the environment's behaviour and becomes autonomous, so instead of asking full feedback, we only ask the controller to be autonomous on good states, that is,

ker γ |_{K} \leq ker γ \circ α |_{K}

Regulation Condition

We want all environment states to be good states, i.e, to have reached convergence. Since $X_{E}$ is not necessarily a subset of $X$ , we ask

{~ X}_{E} \subseteq K

The Controller Models the Environment

With this assumptions and setup, we're already able to prove the first part of the theorem, that states that the controller models/tracks the environment

Conclusion (1)

There exists a unique mapping $α_{C} : X_{C} \to X_{C}$ determined by $α_{C} \circ γ |_{K} = γ \circ α |_{K}$

Proof:

Let $x_{C} \in X_{C}$ , then $x_{C} = γ (x)$ , for $x \in K$ . Define $α_{C} (x_{C}) = γ \circ α (x)$ . $α_{C}$ is well defined because if $x_{C} = γ (x) = γ (x^{'})$ , then $(x, x^{'}) \in ker γ |_{K} ⟹ (x, x^{'}) \in ker γ α |_{K} ⟹ γ (α (x)) = γ (α (x^{'}))$ .

The mapping $α_{C}$ is uniquely determined by $α_{C} \circ γ |_{K} = γ \circ α |_{K}$ : let $α_{C, 1}$ and $α_{C, 2}$ satisfying $α_{C, i} \circ γ |_{K} = γ \circ α |_{K}$ . Then, for any $x_{C} =\in X_{C}, x_{C} = γ (x)$ for some $x \in K$ , since $γ$ is surjective and thus $α_{C, 1} (x_{C}) = α_{C, 1} (γ (x)) = γ (α (x)) = α_{C, 2} (γ (x)) = α_{C, 2} (x_{C})$

This point guarantees that we have an autonomous $α_{C} : X_{C} \to X_{C}$ and that it

Conclusion (2)

It's true that $α_{C} \circ γ |_{{~ X}_{E}} = γ \circ α |_{{~ X}_{E}}$

Proof: Let $x \in {~ X}_{E}$ , since ${~ X}_{E} \subset K$ by the regulation condition, $x \in K$ and hence $α_{C} \circ γ (x) = γ \circ α (x) = γ \circ α |_{{~ X}_{E}} (x)$ , since $x \in {~ X}_{E}$ .

Note that if $X_{E} \subseteq X$ , we have $α |_{X_{E}} = α_{E}$ , so this states that $α_{C} \circ γ |_{X_{E}} = γ \circ α_{E}$ , that is, the controller tracks the environment. For more comments on why $α_{C} \circ γ |_{X_{E}} = γ \circ α_{E}$ means the controller tracks/models the environment, check out my first post and the section below.

Note that $α |_{{~ X}_{E}}$ can be thought of as the environment dynamics inserted into the world.

We say the "internal model" in the controller is the pair consisting of the set $X_{M} := γ |_{{~ X}_{E}} ({~ X}_{E})$ and the rule $α_{C} |_{X_{M}}$

In what sense is this a 'model'?

We now digress about the meaning of the theorem above.

On this session, it's helpful to think that $X_{E} \subseteq X$ , thus $α_{E} = α |_{X_{E}}$ . The second conclusion of the theorem is $α_{C} \circ γ |_{X_{E}} = γ |_{X_{E}} \circ α_{E}$ , that is, $α_{C}$ evolves according to $α_{E}$ autonomously: Consider the environment is in state $x_{E} \in X_{E}$ at time-step $t$ , its next state is $α_{E} (x_{E})$ and the internal system state (which can be calculated via $γ$ ) is $γ |_{X_{E}} (α_{E} (x_{E}))$ in time-step $t + 1$ . On the other hand, $γ (x_{E})$ is the system’s state in time-step $t$ , and in time-step $t + 1$ it is $α_{C} (γ (x_{E}))$ . The theorem ensures those two expressions for the internal system’s state in time-step $t + 1$ are equal.

This is qualitatively different from the internal system receiving information about the environment and one time-step later updating to track the environment. The internal system is updating accordingly to the environment in the same time-step.

To clarify this, consider an analogy of a dog running after a beetle. The dog is initially in the position $d_{0}$ and the beetle in the position $b_{0}$ . One thing that might happen is that the dog see the beetle's position and starts to move there. When the dog arrives, the beetle already left that position, and is now in a new position. So the dog starts moving to this new beetle position, and when he arrives, the beetle already left again, and so on.

Another qualitatively different thing that can happen is that the dog actually understands (i.e, "models") how the beetle will move. So instead of moving to the beetle's current position, it moves to where it predicts the beetle will be in the future, so that it can catch it.

The theorem also states $α_{C}$ is unique, so it can’t be the case that there’s other dynamics in a system that satisfies perfect regulation and feedback structure. Hence, the dynamics on the internal model will necessarily be the one that models the environment.

In the dog-beetle analogy, the theorem guarantees that the second scenario will necessarily happen if the dog behavior satisfyies our hypothesis.

The Controller Faithfully Models the Environment

To understand this session better, we recommend the reading of the first post. There, we explain more intuitively the key ideas used here.

In our first post, we were able to prove an analogous result with only feedback structure condition. There, we didn't need to worry about the difference between environment and world and good states. The last theorem is a particular scenario of this post's theorem when we have $X_{E} = K = X$ and $α_{E} = α |_{X_{E}}$ .

Analogous to the first post, the results proved above includes pathological cases, where, for example, the controller could be $X_{C} = {x_{C, 1}, x_{C, 2}}$ and the environment ${~ X}_{E} = N$ . Even if we had that $α_{C} \circ γ |_{{~ X}_{E}} = γ \circ α |_{{~ X}_{E}}$ , the controller wouldn't have the "expressivity" to faithfully model the environment: it doesn't have enough states to do so. We'll now introduce the observability condition and prove a lemma to conclude the controller faithfully models the environment if, additionally to the prior assumptions, observability is also satisfied.

Observability can be stated as

inf {ker γ \circ α^{n} |_{{~ X}_{E}}; n = 0, 1, 2, \dots} = ⊥

For intuition on what does this mean and why we ask this, check out this section of the first post

Generalized Feedback Lemma

We'll prove a generalization of the feedback structure condition in an analogous way we did in the first post, we'll use that together with observability to conclude the controller faithfully models the environment.

The result we want to prove is

ker γ |_{{~ X}_{E}} \leq ker γ \circ α |_{{~ X}_{E}} \leq ker γ \circ α^{2} |_{{~ X}_{E}} \leq ker γ \circ α^{3} |_{{~ X}_{E}} \leq \dots \leq ker γ \circ α^{k} |_{{~ X}_{E}} \leq \dots

We'll prove it by induction on $k$ .

(Base case): If $k = 1$ , by feedback condition, we know $ker γ |_{K} \leq ker γ \circ α |_{K}$ . Since ${~ X}_{E} \subseteq K$ by regulation condition, $(x, y) \in ker γ |_{{~ X}_{E}} ⟹ (x, y) \in ker γ |_{K} ⟹ (x, y) \in ker γ \circ α |_{K}$ , but since ${~ X}_{E}$ is $α$ -invariant, and $(x, y) \in {~ X}_{E}$ , then $α (x), α (y) \in {~ X}_{E}$ , so $(x, y) \in ker γ \circ α |_{{~ X}_{E}}$ .

(Induction Hypothesis): Suppose now the theorem holds up until $n$ .

(Inductive step) Let $x, y \in {~ X}_{E}$ , then, by definition of kernel of a function, $(x, y) \in ker γ \circ α^{n} |_{{~ X}_{E}} ⟺ (α^{n} (x), α^{n} (y)) \in ker γ |_{{~ X}_{E}}$

Let $s := α^{n} (x), t := α^{n} (y)$ . Since ${~ X}_{E}$ is $α$ -invariant, we know $s, t \in {~ X}_{E}$ . So the equivalence above gives us $(s, t) \in ker γ |_{{~ X}_{E}}$ . By the base case, we have that $(α (s), α (t)) \in ker γ \circ α |_{{~ X}_{E}}$ , but $(α (s), α (t)) = (α^{n + 1} (x), α^{n + 1} (y))$ , so the result follows by induction.

Conclusion (3)

$γ |_{{~ X}_{E}}$ is injective

Proof:

By generalized feedback lemma, $inf {ker γ \circ α^{n} |_{{~ X}_{E}}; n = 0, 1, 2, \dots} = ker γ |_{{~ X}_{E}}$
By observability, $inf {ker γ \circ α^{n} |_{K}; n = 0, 1, 2, \dots} = ⊥$ , but since ${~ X}_{E} \subset K$ , $ker γ \circ α^{n} |_{K} \leq ker γ \circ α^{n} |_{{~ X}_{E}}$ , for any $n$ . Hence, $inf {ker γ \circ α^{n} |_{{~ X}_{E}}; n = 0, 1, 2, \dots} \leq inf {ker γ \circ α^{n} |_{K}; n = 0, 1, 2, \dots} = ⊥$
So $ker γ |_{{~ X}_{E}} = ⊥$ , so $γ |_{{~ X}_{E}}$ injective

In the next section, we explain why $γ |_{{~ X}_{E}}$ being an injection provides us the meaning of faiyhfully modeling.

Cardinality and expressivity

In the first post, we concluded $γ : X \to X_{C}$ satisfying the theorem's hypothesis is necessarily a bijection. In terms of cardinality, this means, by definition, that $| X_{C} | = | X |$ .

In this post, we concluded $γ : X_{E} \to X_{C}$ injection. In cardinality terms, $| X_{E} | \leq | X_{C} |$ .

Note that the elements of $X_{E}$ fully represents the description of each environment state. Say the environment is the outside of a chemical plant and say the chemical reactions being done there depends on the temperature and pressure of the plant, which in turns depends on the temperature and pressure of the environment. Then, elements $x_{E} \in X_{E}$ could be ordered pairs $x_{E} = (T_{E}, P_{E})$ comprising the external temperature and pressure. On the other hand, if there's an additional relevant environment feature, say, the water delivered from the environment need to be treated (for example, because otherwise the water used in the reactions will be contaminated), one would want to consider an environment state as $x_{E} = (T_{E}, P_{E}, W_{E})$ , where $W_{E}$ stands for "quality of water". In other words, $W_{E}$ is a relevant feature and if we don't include it in the environment, it's as if the environment is not "expressive" enough to model the whole system, and thus even if the theorem applies and the controller of the plant develops an internal model, it could be that it runs poorly, i.e, have bad performance overall (note, though, that we haven't discussed what it means for an internal model to have good performance in reality).

The point here is that the elements of $X_{E}$ are the only piece of our framework that encompasses the notion of "expressivity". All the notion of "expressivity" in our setup comes from the sets $X_{E}$ and $X_{C}$ (for the controller, you could also say the expressivity comes from $γ$ , since $X_{C}$ is determined by $γ$ , i.e, $γ (X) = X_{C}$ ).

The example above illustrated one way the states can show expressivity, that is, by the structure of the set of states: If $X \subseteq R^{14}$ , we perhaps have 14 different relevant features, that could be independent or not. If $X \subseteq N^{10}$ , we have 10 features with natural values.

Another way the states have expressivity is via the state set cardinality, as we talked in the beginning of this section: if $X$ has $| X | = 10$ (that is, 10 elements), it can only represent 10 different states. If $| X | = | N |$ , then there's a countable quantity of different states. Recall that cardinality is defined via mappings:

$f : X \to Y$ injection $⟺ | X | \leq | Y |$
$f : X \to Y$ surjection $⟺ | X | \geq | Y |$
$f : X \to Y$ bijection $⟺ | X | = | Y |$

Thus, a state set $X$ exhibits expressivity in two ways:

Via $X$ 's internal structure
Via $X$ 's cardinality

Our theorem states that a necessary condition for the internal system to have an internal model of the environment is $| X_{E} | \leq | X_{C} |$ , that is, in order for the internal system to model the environment, it must exhibit at least as much expressivity as the controller (in the second sense of expressivity). In other words, if the internal system isn't at least as expressive as the environment, the internal system necessarily can't have an internal model of the environment.

Summarizing the whole theorem-meaning discussion,

The semantics of the world "model" in the theorem means faithfully tracking/simulating. That is, for states in ${~ X}_{E}$ (which the internal system views as $X_{M}$ ) the internal system doesn't wait for the environment and then update a timestep later following the environment. Instead, it updates accordingly to the environment in the same time step. We’ll discuss more about that in a later section
Conclusion (1) guarantees $α_{C}$ - the autonomous dynamics on the controller - is well defined and unique. Conclusion (2) states that $α_{C}$ simulates $α_{E}$ and Conclusion (3) says this simulation is faithful.

Comparing the less general version of the theorem in the first post with this version,

On the left, the setup of the first post. On the right, the setup of the second post - we're calling ${~ X}_{E} := X_{E}$ , since generally $X_{E}$ may not be a subset of $X$

This theorem does apply for systems where $γ$ is not a bijection. In fact, $γ$ is injective on $X_{E}$ and surjective on $K$ . It also seems useful to model a wider range of systems.
Our previous version of the theorem is a particular case of this, with $X_{E} = X = K$ and $α_{E} = α$ .
${~ X}_{E} \subseteq K$ and $X_{C} \subseteq γ (K)$ are the conditions ensuring all the states involved in the internal modeling are good states. Thus, this theorem is supposed to apply after the regulation has "converged" in some sense (such as $t \to \infty$ ).

Thus, modeling here means tracking faithfully.

Example

Consider an unidimensional circular grid of size $N$ (i.e, a row of $N$ squares such that the first square is connected to the last square) with a moving target and an agent that has the goal to pursue that target. The target and the agent can move left and right and are always inside some square of the grid.

The squares of our grid will be the set $N := {0, \dots, N - 1}$ .

We want the world to be able to describe the agent and the moving target, each moving in this $N$ sized grid, so we will consider a state of the world as an ordered pair $(i, j)$ with $i, j \in {0, \dots, N - 1} = N$ and so we'll define the world as $X := N \times N$ . Here, the first coordinate of the pair represents the position of the moving target and the second coordinate represents the position of the agent.

We will consider the environment to be the position of the moving target on the grid, thus $X_{E} := N$ .

We assume the moving target always moves one square to the right. That is, $α_{E} : X_{E} \to X_{E}$ is given by $α_{E} (i) := i + 1 mod N$ , where the $mod N$ is to account for the fact that the world is circular.

We'll define the dynamics of the world $α : X \to X$ by $α (i, j) = (i + 1 mod N, j + 1 mod N)$ if $i = j$ and $α (i, j) = (i + 1 mod N, j mod N)$ if $i \neq j$

Consider $j_{E} : X_{E} \to X$ defined by $j_{E} (x_{E}) := (x_{E}, x_{E})$ . Then, ${~ X}_{E} := j_{E} (X_{E}) \subseteq X$ and $j_{E} \circ α_{E} (x_{E}) = j_{E} (α_{E} (x_{E})) = j_{E} (x_{E} + 1 mod N) = (x_{E} + 1 mod N, x_{E} + 1 mod N) = α ((x_{E}, x_{E})) = α (j_{E} (x_{E})) = α \circ j_{E}$

In other words, let $α_{{~ X}_{E}} := j_{E} \circ α_{E} = α \circ j_{E} = α |_{{~ X}_{E}}$ (since $j_{E}$ is injective). Thus we can use the internal model theorem with ${~ X}_{E} = j_{E} (X_{E})$ as environment instead of $X_{E}$ directly.

For the agent,

$X_{C}$ should represent the states the agent can be in, hence $X_{C} := {0, \dots, N - 1}$
$γ : X \to X_{C}$ defined by
- Clearly, $γ$ is a surjection: given $j \in X_{C}, \exists x = (i, j) \in X$ s.t. $γ (x) = x_{C}, \forall i \in {0, \dots, N - 1}$
We want our agent to pursue the moving target, so we will define $K := {(i, j) \in X | i = j}$

This setup satisfy our hypothesis:

${~ X}_{E} = {(x_{E}, x_{E}) \in X | x_{E} \in X_{E}} \subseteq K = {(i, j) \in X | i = j}$ and $α_{{~ X}_{E}} = α |_{{~ X}_{E}}$
$γ (K) = N = X_{C}$
$ker γ |_{K} \leq ker γ \circ α |_{K}$ .
- Suppose $x, y \in K$ such that $(x, y) \in ker γ ⟺ γ (x) = γ (y)$ . Since $x, y \in K, x = (i, i)$ and $y = (j, j)$ . Then, $i = γ (x) = γ (y) = j ⟹ i = j$ . Now, $γ \circ α (i, i) = γ \circ α (j, j)$ since $i = j$ . Thus, $(x, y) \in ker γ \circ α$ .
${~ X}_{E}$ is $α$ -invariant.
- Let $(x_{E}, x_{E}) \in {~ X}_{E}$ , then $α ((x_{E}, x_{E})) = (x_{E} + 1 mod N, x_{E} + 1 mod N) = j_{E} (x_{E} + 1)$ if $x_{E} + 1 \leq N - 1$ and $= j_{E} (0) if x_{E} + 1 = N$ . That is, $α ((x_{E}, x_{E})) = j_{E} (x_{E}^{'}) \in {~ X}_{E}$ for some $x_{E}^{'} \in X_{E}$ .
$inf ker γ \circ α^{n} |_{{~ X}_{E}} = ⊥$
- This follows from feedback and regulation and $α$ -invariance of ${~ X}_{E}$ : Since feedback holds and ${~ X}_{E} \subseteq K$ , then $ker γ |_{{~ X}_{E}} \leq ker γ \circ α |_{{~ X}_{E}}$ . Because ${~ X}_{E}$ is $α$ -invariant, one can prove the generalized feedback lemma for the sequence $ker γ \circ α^{n} |_{{~ X}_{E}}$ . Then $inf ker γ \circ α^{n} |_{{~ X}_{E}} = ⊥$

Since the theorem's assumptions are all true, we already know that there's a unique $α_{C}$ determined by $α_{C} \circ γ = γ \circ α$ .

That is, $x_{C} = γ (x) = γ ((i, j)) = j$ , $α_{C} (x_{C}) = γ (α (x)) = γ (α (i, j))$ $=$ $j mod N$ if $i \neq j$ and $= j + 1 mod N$ if $i = j$ .

Then the pair $X_{M} = γ |_{{~ X}_{E}} ({~ X}_{E})$ and $α_{C} |_{X_{M}}$ is our internal model.

Discussion and further work

We presented the Internal Model Theorem statement, which basically states that, under a setup of an external system passing signals to an autonomous internal system syuch that these signals satisfy observability, then: feedback structure and regulation implies the controller necessarily have an internal model of the external system.

Some critiques:

As we discuss in the first post, the update rules involved are all deterministic. Thus, the theorem can't represent non-deterministic scenarios such as taking decisions under uncertainty
- Note, though, that the theorem is supposed to model regulation after some sort of equilibria or convergence. I expect intuitively many systems in these situations have at least an approximately deterministic behavior
It's really hard to come up with non-trivial examples for the theorem. This difficulty is particularly present when trying to come up with examples that satisfy Feedback and Observability together
- In fact, I'd be glad to see non-trivial examples of this theorem. If you know or construct one, please write in the comments or send it

IMP as a selection theorem

A selection theorem is a theorem that states something in the lines of "this is the type of agents we expect to find in environments with this specific type of selection pressure. Currently, the theorem doesn't say much because the controller is not properly an agent. It's an autonomous system, but it doesn't act on the world. Phrased as a selection theorem wannabe, the IMP states that in environments where the controller is autonomous in good states, the controller acts as/tracks the environment internal model.

We can think of extending the theorem in two directions

“Systems that approximately regulate their internal state have approximate internal models”, we would get a better version of a selection theorem.
- I’d expect this extended theorem to work for a wider range of real world systems.
- This still wouldn't make the controller look more agentic because it still wouldn't properly act on the world.
"The controller acts upon the environment but it has an internal model of the environment."
- This would make the controller more agentic.

The first point above would make the theorem more useful to real systems, while the last would make the controller feel more like an agent. I'd think those two extensions would provide a selection theorem.

Relevance to Agent Structure problem

The agent-like structure problem is the problem of determining if, given a policy that robustly optimizes far away regions of the space into small chunks of the state space, does this policy has agent structure (by agent structure we mean informally having an internal model and a search process)? Another way to phrase this questions is "under which types of environments is the implication above true?"

Alex gave a loose formalism to answer this question, making some notions more precise:

"If we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Under which conditions does agent-like behavior imply agent-like structure?"

This loose formalism consist of a policy and an environment. The policy receives an observation from the environment, updates its internal state and acts on the environment, changing its state. Then the environment sends a new observation to the policy and so on, in discrete time-steps. The policy is, thus, a function that sends each policy state and observation to a policy state. Analogously, the environment sends each environment state and action to an environment state.

We can define a class of different policies and different environments.

The idea of the formalism is being able to assess a function that associates to each policy in the policy class a number in $[0, 1]$ , thought of as the "degree of agent structure". We expect policies that perform well in a wide range of environments are highly “agentic”. Based on the performance of a policy in a wide range of environments, we want to be able to tell the degree of agent structure this policy has.

In Alex's words,

“One result we might wish to have is that if you tell me how many parameters are in your ML model, how long you ran it for, and what its performance was, then I could tell you the minimum amount of agent structure it must have”.

The important idea here is that we expect to be able to define a function $ϵ$ depending on the parameters, training, performance or other relevant variable such that it’s always true that the structure of a given policy is $\geq 1 - ϵ$ . Since we want the structure function to be between $0$ and $1$ , $ϵ = 1$ would always make this statement true, but we want this function to be zero in the limit, in some sense of limit.

More concretely, in the agent structure setup, we want to be able to:

Have a measure of performance of a given policy in a wide range of environments.
Filter the policy class (because a lookup table containing “optimal” moves for each environment in the environment class would have a very good performance, but we would expect a lookup table would be the policy with less agent structure amongst all policies).
Define $ϵ$ appropriately such that its limit is zero in some sense of asymptotics.

The IMP setup fails these conditions because

It doesn’t encompass a measure of performance of different policies. Actually, we’ve already proved the dynamics of the internal system is unique under IMP conditions, so we would actually have only one policy.
The IMP’s setup doesn’t seem to consider a policy acting over an environment. The internal system (the analogous of policy here) can be influenced by the environment (via $γ$ ), but can’t influence the environment. In a self-driving car analogy, it’s like the car being in position $x = 10$ or $x = 20$ , moving or not corresponds to same environment states.

We wish to extend the IMP in two different ways, addressing those two problems:

Modify the IMP to show that systems which approximately regulate their internal state must have approximate models of their environments.
1. This could give us a notion of performance and different policies.
Rework the IMP so that it applies to controllers regulating the external environment (rather than regulating their internal state).
1. This would solve the fact that in current IMP, the internal system doesn’t interact with the environment.

We expect if one can extend the theorem to these two different situations, it might give some insight on the agent structure problem.

^{^}
Supervisory Control of Discrete-Event Systems, (2019) Cai & Wonham as section 1.5.

LESSWRONG
LW