This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF.
This is the second part of a two posts series explaining the Internal Model Principle and how it might relate to AI Safety, particularly to Agent Foundations research. In the first post, we constructed a simplified version of IMP that was easier to understand and focused on building intuition about the theorem's assumptions. In this second post, we explain the general version of the theorem as stated by Cai&Wonham[1] and discuss how it relates to alignment-relevant questions such as agent-structure problem and selection theorems.
We present the more general setup and explain Feedback and Regulation conditions. With those, we prove a first result. Then, we discuss the meaning of this first result and add another condition called "Observability" to derive another result. After this, we provide a worked out example and discuss possible extensions of the theorem.
The IMP from Cai&Wonham[1] originally models the following control theory situation:
Suppose there's, for example, a chemical powerplant in a given environment. The temperature of the plant should be constant and the plant's temperature is affected by the environment's temperature, since they exchange heat, so there is a temperature system made of heaters and air conditioning systems inside the plant, controlled by a controller. This controller controls this systems: it receives information from the environment's temperature and passes a signal to the temperature inside the plant to counter-act the effect of the environment:
Naturally, one thing that could happen is that the controller could receive information from the environment and then, after that, control the temperature system. The theorem shows that, under specific circumstances, the controller actually "foresee" the environment's temperature and control the plant temperature concurrently to the environment's temperature changing. We use "foresee" because what the theorem really shows is that, under some circumstances, the controller is autonomous (i.e, it doesn't use outside information to decide how to control the temperature system) and it controls the temperature system in a way that counter-acts the environment's effect (even without using information from the environment).
In Cai&Wonham's[1] book,
"Environment (E), Controller (C) and Plant (P) are dynamic systems which generate, via suitable output maps, the reference, control and output signals respectively. The objective of regulation is to ensure that the output signal coincides (eventually) with the reference, namely the system ‘tracks’. To this end the output is ‘fed back’ and compared (via ) to the reference, and the resulting tracking error signal used to ‘drive’ the controller. The latter in turn controls the plant, causing its output to approach the reference, so that the tracking error eventually (perhaps as ) approaches ‘zero’. Our aim is to show that this setup implies, when suitably formalized, that the controller incorporates a model of the Environment: this statement is the Internal Model Principle."
We model the world as any set . For example, we could consider the whole world as , comprised of environment, controller and plant. We assume there's a discrete and deterministic world dynamics that specifies how the world evolves.
We could also consider and as a single joint system called internal system (or just controller), and then we'd have (we're calling just ).
We say a system is autonomous if its next state depends only on the previous state. For example, if and are sets, then the function is autonomous and the function is not. We'll just say "A is autonomous" as a slight abuse of terminology. We'll usually write to denote that is an autonomous system inside - while don't need necessarily to be autonomous, for . We assume the environment is autonomous (intuitively, because we can't control the environment, it's a given) with update rule , where . We assume, however, the environment is always autonomous, so .
We'll denote by the set of good states (that is, the set we want our system to be in after a large number of time-steps). It could be, for example, states where the plant's temperature is constant, for any environment state. A important thing to notice is that is not necessarily the set of states such that the system converges to, because it need not be -invariant - that is, we usually say a system converges to a part of it if it remains in that part after convergence - and it may not be the case that for - we haven't added any specific condition ensuring that. We don't care about what happened "prior to convergence", so all the states and assumptions we care about will be true on .
Since , we can project states out of . For example, if , we can write the canonical projection given by . More generally, we could have any function acting as "projection". We define , that is, 's elements are controller states that we get from projecting good states to controller states. Recall that earlier we said we would use the "" notation to denote autonomous systems, and one of the conclusions of the theorem is that will be autonomous - i.e, we will be able to define a function that is also interesting to us. At last, note that, by definition, is surjective.
So until now in our setup we have:
At this point, the reader might be tempted to ask two things:
To answer these questions, we'll first play around with the setup, and then arrive at a property that can be abstractly imposed to other setups.
If , we have that . Here, it's not true that , but it's obvious that is "a part" of or "like a subset" of . We want a general property to relate and such that is "like a subset" of . In mathematics, we use the notion of insertion to work with this. If , one can trivially define an injection between and (the identity, for example). If , we think of as "like a subset" of if one can define an injective function . So in general we can just ask for and to be such that there's an injective function . This partly answers question 1. Now, we'll look closer at insertions to understand if any insertion is allowed.
(For the following, if you're not familiar with equivalence relations and partitions, go read this section of my first post)
Note that is not necessarily invertible, but is a bijection (exercise for the reader), so invertible.
is a set containing lines parallel to the -axis, so while maps points on to points on , maps lines to points on . The inverse takes a point in and map it to the specific parallel line such that, when projected over , it reaches . Hence, if we want to define an insertion, we can do the following:
By this procedure, if , for example, we can define an uncountable amount of insertions between and .
Note, however, that we want the environment to be part of the world, and thus to evolve consistently with the world. For a given , we have two ways of producing updates of the world now:
They don't necessarily need to be the same.
Thus, to answer question 1, we ask that and are such that there's injective.
To answer question 2, we ask that this injection is such that .
We define .
Note that, if , this is the same as asking , so deep down we're asking something related to being -invariant. Indeed, is -invariant because:
We want to state something that enables us to proof the controller is autonomous and it models the environment. In the first post, we argued that one way to ensure that is by asking the feedback condition:
We want to be able to model a situation where the controller, prior to convergence, needs information from the environment to "learn" how the environment works. Then, after a large number of time-steps, the controller learns the environment's behaviour and becomes autonomous, so instead of asking full feedback, we only ask the controller to be autonomous on good states, that is,
We want all environment states to be good states, i.e, to have reached convergence. Since is not necessarily a subset of , we ask
With this assumptions and setup, we're already able to prove the first part of the theorem, that states that the controller models/tracks the environment
There exists a unique mapping determined by
Proof:
Let , then , for . Define . is well defined because if , then.
The mapping is uniquely determined by : let and satisfying . Then, for any for some , since is surjective and thus
This point guarantees that we have an autonomous and that it
It's true that
Proof: Let , since by the regulation condition, and hence , since .
Note that if , we have , so this states that , that is, the controller tracks the environment. For more comments on why means the controller tracks/models the environment, check out my first post and the section below.
Note that can be thought of as the environment dynamics inserted into the world.
We say the "internal model" in the controller is the pair consisting of the set and the rule
We now digress about the meaning of the theorem above.
On this session, it's helpful to think that , thus . The second conclusion of the theorem is , that is, evolves according to autonomously: Consider the environment is in state at time-step , its next state is and the internal system state (which can be calculated via ) is in time-step . On the other hand, is the system’s state in time-step , and in time-step it is . The theorem ensures those two expressions for the internal system’s state in time-step are equal.
This is qualitatively different from the internal system receiving information about the environment and one time-step later updating to track the environment. The internal system is updating accordingly to the environment in the same time-step.
To clarify this, consider an analogy of a dog running after a beetle. The dog is initially in the position and the beetle in the position . One thing that might happen is that the dog see the beetle's position and starts to move there. When the dog arrives, the beetle already left that position, and is now in a new position. So the dog starts moving to this new beetle position, and when he arrives, the beetle already left again, and so on.
Another qualitatively different thing that can happen is that the dog actually understands (i.e, "models") how the beetle will move. So instead of moving to the beetle's current position, it moves to where it predicts the beetle will be in the future, so that it can catch it.
The theorem also states is unique, so it can’t be the case that there’s other dynamics in a system that satisfies perfect regulation and feedback structure. Hence, the dynamics on the internal model will necessarily be the one that models the environment.
In the dog-beetle analogy, the theorem guarantees that the second scenario will necessarily happen if the dog behavior satisfyies our hypothesis.
To understand this session better, we recommend the reading of the first post. There, we explain more intuitively the key ideas used here.
In our first post, we were able to prove an analogous result with only feedback structure condition. There, we didn't need to worry about the difference between environment and world and good states. The last theorem is a particular scenario of this post's theorem when we have and .
Analogous to the first post, the results proved above includes pathological cases, where, for example, the controller could be and the environment . Even if we had that , the controller wouldn't have the "expressivity" to faithfully model the environment: it doesn't have enough states to do so. We'll now introduce the observability condition and prove a lemma to conclude the controller faithfully models the environment if, additionally to the prior assumptions, observability is also satisfied.
Observability can be stated as
For intuition on what does this mean and why we ask this, check out this section of the first post
We'll prove a generalization of the feedback structure condition in an analogous way we did in the first post, we'll use that together with observability to conclude the controller faithfully models the environment.
The result we want to prove is
We'll prove it by induction on .
(Base case): If , by feedback condition, we know. Since by regulation condition, , but since is -invariant, and , then , so .
(Induction Hypothesis): Suppose now the theorem holds up until .
(Inductive step) Let , then, by definition of kernel of a function,
Let . Since is -invariant, we know . So the equivalence above gives us . By the base case, we have that , but , so the result follows by induction.
is injective
Proof:
In the next section, we explain why being an injection provides us the meaning of faiyhfully modeling.
In the first post, we concluded satisfying the theorem's hypothesis is necessarily a bijection. In terms of cardinality, this means, by definition, that .
In this post, we concluded injection. In cardinality terms, .
Note that the elements of fully represents the description of each environment state. Say the environment is the outside of a chemical plant and say the chemical reactions being done there depends on the temperature and pressure of the plant, which in turns depends on the temperature and pressure of the environment. Then, elements could be ordered pairs comprising the external temperature and pressure. On the other hand, if there's an additional relevant environment feature, say, the water delivered from the environment need to be treated (for example, because otherwise the water used in the reactions will be contaminated), one would want to consider an environment state as , where stands for "quality of water". In other words, is a relevant feature and if we don't include it in the environment, it's as if the environment is not "expressive" enough to model the whole system, and thus even if the theorem applies and the controller of the plant develops an internal model, it could be that it runs poorly, i.e, have bad performance overall (note, though, that we haven't discussed what it means for an internal model to have good performance in reality).
The point here is that the elements of are the only piece of our framework that encompasses the notion of "expressivity". All the notion of "expressivity" in our setup comes from the sets and (for the controller, you could also say the expressivity comes from , since is determined by , i.e, ).
The example above illustrated one way the states can show expressivity, that is, by the structure of the set of states: If , we perhaps have 14 different relevant features, that could be independent or not. If , we have 10 features with natural values.
Another way the states have expressivity is via the state set cardinality, as we talked in the beginning of this section: if has (that is, 10 elements), it can only represent 10 different states. If , then there's a countable quantity of different states. Recall that cardinality is defined via mappings:
Thus, a state set exhibits expressivity in two ways:
Our theorem states that a necessary condition for the internal system to have an internal model of the environment is , that is, in order for the internal system to model the environment, it must exhibit at least as much expressivity as the controller (in the second sense of expressivity). In other words, if the internal system isn't at least as expressive as the environment, the internal system necessarily can't have an internal model of the environment.
Summarizing the whole theorem-meaning discussion,
Comparing the less general version of the theorem in the first post with this version,
Thus, modeling here means tracking faithfully.
Consider an unidimensional circular grid of size (i.e, a row of squares such that the first square is connected to the last square) with a moving target and an agent that has the goal to pursue that target. The target and the agent can move left and right and are always inside some square of the grid.
The squares of our grid will be the set .
We want the world to be able to describe the agent and the moving target, each moving in this sized grid, so we will consider a state of the world as an ordered pair with and so we'll define the world as . Here, the first coordinate of the pair represents the position of the moving target and the second coordinate represents the position of the agent.
We will consider the environment to be the position of the moving target on the grid, thus .
We assume the moving target always moves one square to the right. That is, is given by , where the is to account for the fact that the world is circular.
We'll define the dynamics of the world by if and if
Consider defined by . Then, and
In other words, let (since is injective). Thus we can use the internal model theorem with as environment instead of directly.
For the agent,
This setup satisfy our hypothesis:
Since the theorem's assumptions are all true, we already know that there's a unique determined by .
Then the pair and is our internal model.
We presented the Internal Model Theorem statement, which basically states that, under a setup of an external system passing signals to an autonomous internal system syuch that these signals satisfy observability, then: feedback structure and regulation implies the controller necessarily have an internal model of the external system.
Some critiques:
A selection theorem is a theorem that states something in the lines of "this is the type of agents we expect to find in environments with this specific type of selection pressure. Currently, the theorem doesn't say much because the controller is not properly an agent. It's an autonomous system, but it doesn't act on the world. Phrased as a selection theorem wannabe, the IMP states that in environments where the controller is autonomous in good states, the controller acts as/tracks the environment internal model.
We can think of extending the theorem in two directions
The first point above would make the theorem more useful to real systems, while the last would make the controller feel more like an agent. I'd think those two extensions would provide a selection theorem.
The agent-like structure problem is the problem of determining if, given a policy that robustly optimizes far away regions of the space into small chunks of the state space, does this policy has agent structure (by agent structure we mean informally having an internal model and a search process)? Another way to phrase this questions is "under which types of environments is the implication above true?"
Alex gave a loose formalism to answer this question, making some notions more precise:
"If we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Under which conditions does agent-like behavior imply agent-like structure?"
This loose formalism consist of a policy and an environment. The policy receives an observation from the environment, updates its internal state and acts on the environment, changing its state. Then the environment sends a new observation to the policy and so on, in discrete time-steps. The policy is, thus, a function that sends each policy state and observation to a policy state. Analogously, the environment sends each environment state and action to an environment state.
We can define a class of different policies and different environments.
The idea of the formalism is being able to assess a function that associates to each policy in the policy class a number in , thought of as the "degree of agent structure". We expect policies that perform well in a wide range of environments are highly “agentic”. Based on the performance of a policy in a wide range of environments, we want to be able to tell the degree of agent structure this policy has.
In Alex's words,
“One result we might wish to have is that if you tell me how many parameters are in your ML model, how long you ran it for, and what its performance was, then I could tell you the minimum amount of agent structure it must have”.
The important idea here is that we expect to be able to define a function depending on the parameters, training, performance or other relevant variable such that it’s always true that the structure of a given policy is . Since we want the structure function to be between and , would always make this statement true, but we want this function to be zero in the limit, in some sense of limit.
More concretely, in the agent structure setup, we want to be able to:
The IMP setup fails these conditions because
We wish to extend the IMP in two different ways, addressing those two problems:
We expect if one can extend the theorem to these two different situations, it might give some insight on the agent structure problem.
Supervisory Control of Discrete-Event Systems, (2019) Cai & Wonham as section 1.5.