*Thanks to John Wentworth for helpful critique.*

## 0. Introduction

Agency/goal-driven behavior is a complex thing. By definition, it implies a system that is deliberately choosing to take actions that will achieve the goal it wants. That requires, at the least, an internalized "goal", a world-model, a mechanism that allows the world-model to be ran iteratively as the agent searches for optimal actions, and so on.

But taking some intuitive, idealized, pre-established notion of "agency", and translating these intuitions into formalisms, is somewhat backwards. If our goal is to understand the kinds of AI we'll get, a more natural question seems to be: "what properties of the environment make goal-driven behavior *necessary*, and what shape the internal mechanisms for it *need* to take?".

Going further, we can notice that the SGD and evolution are both incremental processes, and agency is too complex to develop from scratch. So we can break the question down further: "what properties of the environment make the *building blocks* of agency necessary?". How are these building blocks useful in the intermediary stages, before they're assimilated into the agency engine? What pressures these building blocks to incrementally improve, until they become precise/advanced enough to serve the needs of agency?

This post attempts to look at the incremental development of one of these building blocks — world models, and the environment conditions that make them necessary and incentivize improving their accuracy.

## 1. Intuitions

When trying to identify the direction of incremental improvement, it helps to imagine two concrete examples at different ends of the spectrum, then try to figure out the intermediary stages between them.

At one end of the spectrum, we have glorified look-up tables. Their entire policy is loaded into them during development/training, and once they're deployed, it's not modified.

At the other end of the spectrum, we have idealized agents. Idealized agents model the environment and adapt their policy to it, on a moment-to-moment basis, trying to perform well in the specific situation they're in.

In-between, we have things like humans, who use intensive action-optimization reminiscent of idealized agents *in tandem *with heuristics and built-in instincts, which they often apply "automatically", if the situation seems to call for it on the surface.

Between look-up tables and humans, we have in-context learners, like advanced language models. They can improve their performance on certain problems post-deployment, after being shown one or a few examples. Much like humans, they tailor their approach.

This is the value that seems to be increasing, from look-up tables to language models to humans to idealized agents. The more "agenty" a system is, the more it tailors its policy to the precise details of the situation it's in, rather than relying on pre-computed heuristics.

Turning it around, this implies that there are some environment properties that would make this necessary. What are those?

GPT-3 is often called a "shallow pattern-matcher". One of the stories of how it works is that it's an advanced search engine. It has a vast library of contexts and optimal behaviors in them, and whenever prompted with something, it finds the best matches, and outputs some weighted sum of them. Alternatively, it may be described as a library of human personas/writing styles, which treats prompts as information regarding which of these patterns it is supposed to roleplay.

That is "shallow pattern-matching". It looks at the surface reading of the situation, then acts like it's been taught to act in situations that looked like this. The more advanced a pattern-matcher gets, the deeper its cognitive algorithms may become, but it approaches agency from the top-down.

This trick will not work in contexts where the optimal policy can't be realistically "interpolated" from a relatively small pool of shallow pairs it's been taught. Certainly, we can imagine an infinite look-up table of such entries for any possible situation which were pre-computed to infallibly impersonate an agent. But in practice, we're constrained by practicality: ML models' memory size is not infinite, and neither are our training sets.

So, if we ask a shallow pattern-matcher to offer up a sufficiently good response to a sufficiently specific situation, in the problem domain where it doesn't have a vast enough library of pre-computed optimal responses, it will fail. And indeed: language models are notoriously bad at retaining consistency across sufficiently long episodes, or coherently integrating a lot of details. They perform well enough as long as there are many ways to perform well enough, or if the problem is simple. But not if we demand a *specific *and *novel *solution to a complex problem.

How can we translate these intuitions into math?

## 2. Formalisms

### 2.1. Foundations

I'll be using the baseline model architecture described in Agents Over Cartesian World Models. To quote:

There are four types: actions, observations, environmental states, and internal states. Actions and observations go from agent to environment and vice-versa. Environmental states are on the environment side, and internal states are on the agent side. Let refer to actions, observations, environmental states, and internal states.

We describe how the agent interfaces with the environment with four maps: , , , and .

- describes how the agent observes the environment, e.g., if the agent sees with a video camera, describes what the video camera would see given various environmental states. If the agent can see the entire environment, the image of is distinct point distributions. In contrast, humans can see the same observation for different environmental states.
- describes how the agent interprets the observation, e.g., the agent's internal state might be memories of high-level concepts derived from raw data. If there is no historical dependence, depends only on the observation. In contrast, humans map multiple observations onto the same internal state.
- describes how the agent acts in a given state, e.g., the agent might maximize a utility function over a world model. In simple devices like thermostats, maps each internal state to one of a small number of actions. In contrast, humans have larger action sets.
- describes how actions affect the environment, e.g., code that turns button presses into game actions. If the agent has absolute control over the environment, for all , the image of is all point distributions over . In contrast, humans do not have full control over their environments.

The primary goal is to open up the black box that is . I'll be doing it by making some assumptions about the environment and the contents of .

For clarity, let , , , and be individual members of the sets of actions, observations, environmental states, and internal states respectively, and be the function representing the agent.

### 2.2. Optimality

Let's expand the model by introducing the following function :

takes in the current environment-state and the action the agent took in response to it, and returns a value representing "how well" the agent is performing. is not an utility function, but defined over one. It considers all possible trajectories through the environment's space-time available to the agent at a given point, calculates the total value the agent accrues in each trajectory, then returns the optimality of an action as the amount of value accrued in the best trajectory of which this action is part.

Formally: Let be the set of all possible policies that start with taking an action in the environmental state , and represent the utility function defined over the environment.^{[1]} Given an initial environment-state , the optimality of taking an action is:

It functions as an utility function in some contexts, such as myopic agents with high time discounting , one-shot regimes where the agent needs to solve the problem in one forward pass, and so on.

Moving on. For a given environment-state , we can define the optimal action, as follows:

Building off of that, for any given environment-state and a minimal performance level , we define a set of *near-optimal actions*:

Intuitively, an action is near-optimal if taking it leaves the agent with the ability to capture at least a fraction of all future value. As a rule-of-thumbs, we want that fraction fairly high, .

A specific action from the set of near-optimal actions shall be denoted .

Now, we can state the core assumption of this model. For two random environment-states and and some distance function , we have:

That is: the more similar two situations are, the more likely it is that if an action does well in one of them, it would do well in the other.

Note that this property places some heavy constraints on the model of the environment and the distance metric we're using.

Some intuitions. Imagine the following three environment-states:

- Our universe at the moment of an AGI's creation; AGI is created by humanity.
- Our universe at the moment of an AGI's creation, except all the galaxies beyond the Solar System have been randomly shuffled around.
- Our universe at the moment of an AGI's creation, except the AGI (with the same values) is created by a non-human alien civilization that evolved in place of humans.

For a purely physical, low-level distance metric, (1) and (2) would be much more different than (1) and (3) (assuming no extra-terrestrial aliens). And yet, the optimal trajectories in (1) and (2) would be basically the same for a certain abstract format of "actions" (the only difference would be in the specific directions the AGI sends out Von Neumann ships after it takes over the world), whereas the optimal trajectories for (1) and (3) would start out wildly different, depending on the specifics of the AGI's situation (what social/cognitive systems it would need to model and subvert, which manipulation tactics to use).

I'm not sure what combinations of can be used here.

Suitable environment-models in general seem to be a bottleneck on our ability to adequately model agents. At the very least, we'd want some native support for multi-level models and natural abstractions.

### 2.3. The Training Set

Next, we turn to the agent proper. We introduce the set of "optimal policies" . For every environment-state , it holds a 2-tuple , where and .

, thus, is the complete set of instructions for how a near-optimal agent with a given utility function over the environment should act in response to any given observation. Note that the behavior of an agent blindly following it would not be optimal unless its sensors can record all of the information about the environment. There would be collusion, in which different environment-states map to the same observation; a simple instruction-executor would not be able to tell them apart.

The training set is some set of initial optimal policies we'll "load" into the agent during training. , where is upper-bounded by practical constraints, such as the agent's memory size, the training time, the compute costs, difficulties with acquiring enough training data, and so on.

Formally, it's assumed that is an -tuple, and , so that every is accessible to the agent's and functions.

To strengthen some conclusions, we'll assume that is well-chosen to cover as much of the environment as possible given the constraint posed by :

That is, all of the policy examples stored in memory are for situations maximally different from each other. Even if it's implausible for actual training sets, we may assume that this optimization is done by the SGD. Given a poorly-organized training set, where solutions for some sub-areas of the environment space are over-represented, the SGD would pick and choose, discarding the redundant ones.

Finally, for any given and a distance variable , we define a set of *reference policies*:

Computing the full set is, again, intractable for any sufficiently complex environment, so I'll be using to refer to any practically-sized sample of it.

Furthermore, to bring the model a little closer to how NNs seem to behave in reality, and to avoid some silliness near the end, we'll define the set of *reference actions*. Letting be a function that takes the second element of a tuple,

That is, it's just some reference policy set with the observation data thrown out.

### 2.4. Agent Architecture

Now, let's combine the formalisms from 2.2 and 2.3.

From (1), we know that knowing the optimal policy for a particular situation gives us information about optimal policies for sufficiently similar situations. Our agent starts out loaded with a number of such optimal policies, but tagged with subjective descriptions of environment-states, not the environment-states themselves. Nonetheless, it seems plausible that there are ways to use them to figure out what to do in unfamiliar-looking situations.

Let's decompose as follows:

is the function that does what we're looking for. It accepts a set of actions, presumed to be a bounded set of reference actions . It returns the action the agent is to take.

The main constraint on , which seems plausible from (1), is as follows:

Informally, is some function which, given a set of actions that perform well in some situations similar to this one, tries to generate a good response to the current situation. And it's more likely to succeed the more similar the reference situations are to the current one.

By implication, for some ,

The shorter the distance , the more likely it is that the agent will perform well.

Suppose we've picked some that would allow our agent to arrive at some desired level of performance with high probability. That is, for a random environment state ,

The question is: how does compute ; and before it, ?

### 2.5. Approximating the Reference Policies Set

**i.** In the most primitive case, it may be a simple lookup table. can function by comparing the observation (which presumably can be extracted from the internal state) against a list of stored observation-action pairs, then outputting the closest match. Letting be a function which retrieves the first element of a tuple, we get:

The follow-up computations are trivial: discard the observation data, then retrieve the only member of the set, and that's the agent's output.

Note that the distance function used to estimate the similarity of observations isn't necessarily the same as the one for environment-states, given that the data formats of and are likely different.

The underlying assumption here is that the more similar the observations, the more likely it is that the underlying environment-states are similar as well.

That is, for a given ,

Not unlike (1), this puts some constraints on what we mean by "observation". Having your eyes open or closed leads to a very large difference in observations, but a negligible difference in the states of the world. As such, "observation" has to, at the least, mean a certain *collection *of momentary sense-data relevant to the context in which we're taking an action — as opposed to every direct piece of sense-data we receive moment-to-moment.

For a language model, at least, the definition is straightforward: an "observation" is the entire set of tokens it's prompted with.

**ii.** A more sophisticated technique is to more fully approximate by computing the following:

That is, it assembles the set of all situations that looked "sufficiently similar" to the agent's observation of the current situation.

would be doing something more complex in this case, somehow "interpolating" the optimal action from the set of reference actions. This, at least, seems to be how GPT-3 works: it immediately computes some prospective outputs, dumps the input, and starts refining the outputs.

**iii.** A still more advanced version implements few-shot learning. The following changes to the agent's model would need to be made:

- Assume that the agent has post-training memory to which it can write additional entries based on its experience.
- Modify so that it can affect the internal state as well, writing into it the action the model takes.
- Introduce some mechanism by which the model can evaluate its performance.
- Add a function that looks at the resulting 3-tuples , and appends to if the behavior is optimal enough.

I won't write all of this out, since it's trivial yet peripheral to my point, but I'll mention it. It's a powerful technique: it's pretty likely that the computed this way would have a short to the next environment-state the agent would find itself in — inasmuch as the agent would already be getting policy data about the specific situation it's in.

(c) is the only challenging bit. A proper implementation of it would make the model a mesa-optimizer. It can be circumvented in various ways, however. For example, the way GPT-3 does it: it essentially assumes that whatever it does is the optimal action, foregoing a reward signal.

**iv)** But what if ? That is, what if *none* of the reference policies loaded into the agent's memory are for situations "sufficiently similar" to the one the agent is facing now?

Furthermore, what if this state of affairs is endemic to a given environment?

(4) In detail:

- We pick some environment and a minimal performance level . (1) holds in this environment.
- We take a ML model with a memory size , and stuff it to the gills with reference policies , picked to cover as many environment-states as possible, as per (2).
- The above setup corresponds to some underlying distance , such that: In this environment, for a random environment-state , the reference policies need to be intended for situations not farther than from it, in order for to be able to shallowly interpolate the optimal policy for it.
- However, .

How can the SGD/evolution solve this?

The following structure seems plausible. Decompose as follows:

- . This function takes in the current internal state, which includes the lookup table of optimal policies , the current observation, and plausibly a history of other recent observations. It returns the probability distribution over the current environment-state.
- . This function takes in the probability distribution over the current environment-state, and computes the optimal response to some environment state sampled from that distribution. Alternatively, it may return a
*set*of such responses, sampling from the distribution several times. A bounded set of reference actions populated by such entries would have .

Plugging it into (3), we get:

Thus, for situations meeting the conditions of (4), an agent would face a pressure to improve its ability to develop accurate models of the environment.

## 3. Closing Thoughts

There's a few different ways to spread the described functionality around. For example, we may imagine that modelling and optimization happen in : that just "primitively" assembles a reference set based on observations, and variants of and are downstream of it. In terms of telling an incremental story, this might actually be the *better* framing. Also, I probably under-used the function; some of the described functionality might actually go into it.

None of this seems to change the conclusion, though. We can define some simple, intuitive environment conditions under which shallow pattern-matching won't work, and in them, *some *part of the agent has to do world-modelling.

From another angle, the purpose of this model is to provide prospective desiderata for *environment-models* suitable for modeling agents. As I'd mentioned in 2.2, I think the lack of good ways to model environments is one of the main barriers on our ability to model agency/goal-directedness.

A weaker result is suggested by . The definition of an optimal action performs an explicit search for actions that best maximize some value function. It would be natural to expect that would replicate this mechanic, implying mesa-optimization.

I'm hesitant to make that leap, however. There may be other ways for to extract the correct action from a given environment-state, acting more like a calculator. While the need for a world-model in some conditions appears to be a strong consequence of this formalism, the same cannot be said for goal-driven search. Deriving the necessary environment conditions for mesa-optimization, thus, requires further work.

## Appendix

This is a bunch of formulae deriving which seemed like an obvious thing to do, but which turned out so messy I'm not sure they're useful.

**1. **We can measure the "generalizability" of an environment. That is: Given an environment and a desired level of performance , what is the "range" centered on a random state within which lie environment-states such that the optimal policies for these states are likely near-optimal for the initial state ?

The answer, as far as I can tell, is this monstrosity:

Where is an environment-state sampled from the uniform distribution over the set of states not farther than from .

For a given , this function can be used to measure the intuition I'd referred to back in Part 1: how "tailor-made" do policies need to be to do well in a given environment?

**2.** Suppose we have an -sized training set , and the set of corresponding environment-states , such that every policy in is optimal for some environment-state in . We can define the average distance from a random environment-state to one of the states in :

**3.** And now, for a given environment , a training set of size , and a target performance , we can theoretically figure out whether the ML model would need to develop a world-model. The condition is simple:

This, in theory, is a powerful tool. But the underlying functions seem quite... unwieldy. Faster/simpler approximations would need to be found before it's anywhere close to practically applicable.

^{^}Expanding the model to support other kinds of utility functions from Agents Over Cartesian World Models seems possible, if a bit fiddly.

Hey Thane, interesting stuff! Any chance you read my recent things on 'deliberation'? It feels like we're interested in similar questions

^{[1]}but approaching from different perspectives (I'm sort of trying to look at the bit 'just after' the 'world model'). You might find it interesting or helpful.not surprising as we've both been speaking to John and taken inspiration from him and from Scott G's (A→B)→A ↩︎