Review

Review

What Environment Properties Select Agents For World-Modeling?

5Oliver Sourbut

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 12:45 PM

Hey Thane, interesting stuff! Any chance you read my recent things on 'deliberation'? It feels like we're interested in similar questions^{[1]} but approaching from different perspectives (I'm sort of trying to look at the bit 'just after' the 'world model'). You might find it interesting or helpful.

not surprising as we've both been speaking to John and taken inspiration from him and from Scott G's ↩︎

Thanks to John Wentworth for helpful critique.## 0. Introduction

Agency/goal-driven behavior is a complex thing. By definition, it implies a system that is deliberately choosing to take actions that will achieve the goal it wants. That requires, at the least, an internalized "goal", a world-model, a mechanism that allows the world-model to be ran iteratively as the agent searches for optimal actions, and so on.

But taking some intuitive, idealized, pre-established notion of "agency", and translating these intuitions into formalisms, is somewhat backwards. If our goal is to understand the kinds of AI we'll get, a more natural question seems to be: "what properties of the environment make goal-driven behavior

necessary, and what shape the internal mechanisms for itneedto take?".Going further, we can notice that the SGD and evolution are both incremental processes, and agency is too complex to develop from scratch. So we can break the question down further: "what properties of the environment make the

building blocksof agency necessary?". How are these building blocks useful in the intermediary stages, before they're assimilated into the agency engine? What pressures these building blocks to incrementally improve, until they become precise/advanced enough to serve the needs of agency?This post attempts to look at the incremental development of one of these building blocks — world models, and the environment conditions that make them necessary and incentivize improving their accuracy.

## 1. Intuitions

When trying to identify the direction of incremental improvement, it helps to imagine two concrete examples at different ends of the spectrum, then try to figure out the intermediary stages between them.

At one end of the spectrum, we have glorified look-up tables. Their entire policy is loaded into them during development/training, and once they're deployed, it's not modified.

At the other end of the spectrum, we have idealized agents. Idealized agents model the environment and adapt their policy to it, on a moment-to-moment basis, trying to perform well in the specific situation they're in.

In-between, we have things like humans, who use intensive action-optimization reminiscent of idealized agents

in tandemwith heuristics and built-in instincts, which they often apply "automatically", if the situation seems to call for it on the surface.Between look-up tables and humans, we have in-context learners, like advanced language models. They can improve their performance on certain problems post-deployment, after being shown one or a few examples. Much like humans, they tailor their approach.

This is the value that seems to be increasing, from look-up tables to language models to humans to idealized agents. The more "agenty" a system is, the more it tailors its policy to the precise details of the situation it's in, rather than relying on pre-computed heuristics.

Turning it around, this implies that there are some environment properties that would make this necessary. What are those?

GPT-3 is often called a "shallow pattern-matcher". One of the stories of how it works is that it's an advanced search engine. It has a vast library of contexts and optimal behaviors in them, and whenever prompted with something, it finds the best matches, and outputs some weighted sum of them. Alternatively, it may be described as a library of human personas/writing styles, which treats prompts as information regarding which of these patterns it is supposed to roleplay.

That is "shallow pattern-matching". It looks at the surface reading of the situation, then acts like it's been taught to act in situations that looked like this. The more advanced a pattern-matcher gets, the deeper its cognitive algorithms may become, but it approaches agency from the top-down.

This trick will not work in contexts where the optimal policy can't be realistically "interpolated" from a relatively small pool of shallow ⟨situation,response⟩ pairs it's been taught. Certainly, we can imagine an infinite look-up table of such entries for any possible situation which were pre-computed to infallibly impersonate an agent. But in practice, we're constrained by practicality: ML models' memory size is not infinite, and neither are our training sets.

So, if we ask a shallow pattern-matcher to offer up a sufficiently good response to a sufficiently specific situation, in the problem domain where it doesn't have a vast enough library of pre-computed optimal responses, it will fail. And indeed: language models are notoriously bad at retaining consistency across sufficiently long episodes, or coherently integrating a lot of details. They perform well enough as long as there are many ways to perform well enough, or if the problem is simple. But not if we demand a

specificandnovelsolution to a complex problem.How can we translate these intuitions into math?

## 2. Formalisms

## 2.1. Foundations

I'll be using the baseline model architecture described in Agents Over Cartesian World Models. To quote:

The primary goal is to open up the black box that is decide. I'll be doing it by making some assumptions about the environment and the contents of I.

For clarity, let a∈A, o∈O, e∈E, and i∈I be individual members of the sets of actions, observations, environmental states, and internal states respectively, and agent:=decide∘orient∘observe be the function representing the agent.

## 2.2. Optimality

Let's expand the model by introducing the following function :

optimality:E×A→Roptimality takes in the current environment-state and the action the agent took in response to it, and returns a value representing "how well" the agent is performing. optimality is not an utility function, but defined over one. It considers all possible trajectories through the environment's space-time available to the agent at a given point, calculates the total value the agent accrues in each trajectory, then returns the optimality of an action as the amount of value accrued in the best trajectory of which this action is part.

Formally: Let Πei→ak be the set of all possible policies that start with taking an action ak in the environmental state ei, and utility:E→R represent the utility function defined over the environment.

optimality(ei,ak):=maxπ∈Πei→ak(∞∑t=0λtEe∼π|t[utility(e)])^{[1]}Given an initial environment-state ei, the optimality of taking an action ak is:It functions as an utility function in some contexts, such as myopic agents with high time discounting λ, one-shot regimes where the agent needs to solve the problem in one forward pass, and so on.

Moving on. For a given environment-state ei, we can define the optimal action, as follows:

aopt|ei∈A | ∀ak∈A:optimality(ei,aopt|ei)≥optimality(ei,ak)Building off of that, for any given environment-state ei and a minimal performance level r∈[0;1], we define a set of

Arn-o|ei:={ak∈A | ∀ak:optimality(ei,ak)optimality(ei,aopt|ei)≥r}near-optimal actions:Intuitively, an action is near-optimal if taking it leaves the agent with the ability to capture at least a fraction r of all future value. As a rule-of-thumbs, we want that fraction fairly high, r≈1.

A specific action from the set of near-optimal actions shall be denoted arn-o|ei∈Arn-o|ei.

Now, we can state the core assumption of this model. For two random environment-states ei and ek and some distance function DISTE(⋅||⋅), we have:

(1)P(arn-o|ek∈Arn-o|ei)∝1DISTE(ei||ek)That is: the more similar two situations are, the more likely it is that if an action does well in one of them, it would do well in the other.

Note that this property places some heavy constraints on the model of the environment and the distance metric we're using.

Some intuitions. Imagine the following three environment-states:

For a purely physical, low-level distance metric, (1) and (2) would be much more different than (1) and (3) (assuming no extra-terrestrial aliens). And yet, the optimal trajectories in (1) and (2) would be basically the same for a certain abstract format of "actions" (the only difference would be in the specific directions the AGI sends out Von Neumann ships after it takes over the world), whereas the optimal trajectories for (1) and (3) would start out wildly different, depending on the specifics of the AGI's situation (what social/cognitive systems it would need to model and subvert, which manipulation tactics to use).

I'm not sure what combinations of ⟨environment model,distance function⟩ can be used here.

Suitable environment-models in general seem to be a bottleneck on our ability to adequately model agents. At the very least, we'd want some native support for multi-level models and natural abstractions.

## 2.3. The Training Set

Next, we turn to the agent proper. We introduce the set of "optimal policies" C. For every environment-state ei, it holds a 2-tuple ci∈C:=⟨oi,arn-o|ei⟩, where oi:=observe(ei) and arn-o|ei∈Arn-o|ei.

C, thus, is the complete set of instructions for how a near-optimal agent with a given utility function over the environment should act in response to any given observation. Note that the behavior of an agent blindly following it would not be optimal unless its sensors can record all of the information about the environment. There would be collusion, in which different environment-states map to the same observation; a simple instruction-executor would not be able to tell them apart.

The training set CT⊂C is some set of initial optimal policies we'll "load" into the agent during training. |CT|=M, where M is upper-bounded by practical constraints, such as the agent's memory size, the training time, the compute costs, difficulties with acquiring enough training data, and so on.

Formally, it's assumed that I is an n-tuple, and CT∈I, so that every ci∈CT is accessible to the agent's orient and decide functions.

To strengthen some conclusions, we'll assume that CT is well-chosen to cover as much of the environment as possible given the constraint posed by M:

(2)∀ci,ck∈CT:DISTE(ei||ek)=constThat is, all of the policy examples stored in memory are for situations maximally different from each other. Even if it's implausible for actual training sets, we may assume that this optimization is done by the SGD. Given a poorly-organized training set, where solutions for some sub-areas of the environment space are over-represented, the SGD would pick and choose, discarding the redundant ones.

Finally, for any given ei and a distance variable d∈[0;+∞), we define a set of

Cd|ei⊂C:={cr∈C | ∀cr:DISTE(ei||er)≤d}reference policies:Computing the full set Cd|ei is, again, intractable for any sufficiently complex environment, so I'll be using C∗d|ei⊂Cd|ei to refer to any practically-sized sample of it.

Furthermore, to bring the model a little closer to how NNs seem to behave in reality, and to avoid some silliness near the end, we'll define the set of

A∗d|ei:={second(cr) | ∀cr∈C∗d|ei}reference actions. Letting second be a function that takes the second element of a tuple,That is, it's just some reference policy set with the observation data thrown out.

## 2.4. Agent Architecture

Now, let's combine the formalisms from 2.2 and 2.3.

From (1), we know that knowing the optimal policy for a particular situation gives us information about optimal policies for sufficiently similar situations. Our agent starts out loaded with a number of such optimal policies, but tagged with subjective descriptions of environment-states, not the environment-states themselves. Nonetheless, it seems plausible that there are ways to use them to figure out what to do in unfamiliar-looking situations.

Let's decompose decide as follows:

act is the function that does what we're looking for. It accepts a set of actions, presumed to be a bounded set of reference actions A∗d|ei. It returns the action the agent is to take.

The main constraint on act, which seems plausible from (1), is as follows:

P(act(A∗d|ei)∈Arn-o|ei)∝1dInformally, act is some function which, given a set of actions that perform well in some situations similar to this one, tries to generate a good response to the current situation. And it's more likely to succeed the more similar the reference situations are to the current one.

By implication, for some r,

(3)P(optimality(ei,agent(oi))optimality(ei,aopt|ei)≥r)∝1dThe shorter the distance d, the more likely it is that the agent will perform well.

Suppose we've picked some d′ that would allow our agent to arrive at some desired level of performance r with high probability. That is, for a random environment state ei,

P(act(A∗d|ei)∈Arn-o|ei | d=d′)≈1The question is: how does f:I→A compute A∗d|ei; and before it, C∗d|ei?

## 2.5. Approximating the Reference Policies Set

C∗d|ei≈{cr∈CT | mincr∈CT(DISTO(oi||first(cr)))}i.In the most primitive case, it may be a simple lookup table. f can function by comparing the observation (which presumably can be extracted from the internal state) against a list of stored observation-action pairs, then outputting the closest match. Letting first be a function which retrieves the first element of a tuple, we get:The follow-up computations are trivial: discard the observation data, then retrieve the only member of the set, and that's the agent's output.

Note that the distance function DISTO used to estimate the similarity of observations isn't necessarily the same as the one for environment-states, given that the data formats of E and O are likely different.

The underlying assumption here is that the more similar the observations, the more likely it is that the underlying environment-states are similar as well.

That is, for a given d,

P(DISTE(ei,ek)≤d | DISTO(oi,ok)≤do)∝1doNot unlike (1), this puts some constraints on what we mean by "observation". Having your eyes open or closed leads to a very large difference in observations, but a negligible difference in the states of the world. As such, "observation" has to, at the least, mean a certain

collectionof momentary sense-data relevant to the context in which we're taking an action — as opposed to every direct piece of sense-data we receive moment-to-moment.For a language model, at least, the definition is straightforward: an "observation" is the entire set of tokens it's prompted with.

C∗d|ei≈{cr∈CT | ∀cr:DISTO(oi||first(cr))≤d}ii.A more sophisticated technique is to more fully approximate Cd|ei∩CT by computing the following:That is, it assembles the set of all situations that looked "sufficiently similar" to the agent's observation of the current situation.

act would be doing something more complex in this case, somehow "interpolating" the optimal action from the set of reference actions. This, at least, seems to be how GPT-3 works: it immediately computes some prospective outputs, dumps the input, and starts refining the outputs.

iii.A still more advanced version implements few-shot learning. The following changes to the agent's model would need to be made:I won't write all of this out, since it's trivial yet peripheral to my point, but I'll mention it. It's a powerful technique: it's pretty likely that the A∗d|ei computed this way would have a short d to the next environment-state the agent would find itself in — inasmuch as the agent would already be getting policy data about the specific situation it's in.

(c) is the only challenging bit. A proper implementation of it would make the model a mesa-optimizer. It can be circumvented in various ways, however. For example, the way GPT-3 does it: it essentially assumes that whatever it does is the optimal action, foregoing a reward signal.

iv)But what if Cd|ei∩CT={ø}? That is, what ifnoneof the reference policies loaded into the agent's memory are for situations "sufficiently similar" to the one the agent is facing now?Furthermore, what if this state of affairs is endemic to a given environment?

(4) In detail:

How can the SGD/evolution solve this?

The following structure seems plausible. Decompose f as follows:

setof such responses, sampling from the distribution several times. A bounded set of reference actions populated by such entries would have d=DISTE(P(ei|I)||ei).Plugging it into (3), we get:

P(optimality(ei,agent(oi))optimality(ei,aopt|ei)≥r)∝1DISTE(P(ei|I)||ei)Thus, for situations meeting the conditions of (4), an agent would face a pressure to improve its ability to develop accurate models of the environment.

## 3. Closing Thoughts

There's a few different ways to spread the described functionality around. For example, we may imagine that modelling and optimization happen in act: that f just "primitively" assembles a reference set based on observations, and variants of model and optimize are downstream of it. In terms of telling an incremental story, this might actually be the

betterframing. Also, I probably under-used the orient function; some of the described functionality might actually go into it.None of this seems to change the conclusion, though. We can define some simple, intuitive environment conditions under which shallow pattern-matching won't work, and in them,

somepart of the agent has to do world-modelling.From another angle, the purpose of this model is to provide prospective desiderata for

environment-modelssuitable for modeling agents. As I'd mentioned in 2.2, I think the lack of good ways to model environments is one of the main barriers on our ability to model agency/goal-directedness.A weaker result is suggested by optimize. The definition of an optimal action aopt|eiperforms an explicit search for actions that best maximize some value function. It would be natural to expect that optimize would replicate this mechanic, implying mesa-optimization.

I'm hesitant to make that leap, however. There may be other ways for optimize to extract the correct action from a given environment-state, acting more like a calculator. While the need for a world-model in some conditions appears to be a strong consequence of this formalism, the same cannot be said for goal-driven search. Deriving the necessary environment conditions for mesa-optimization, thus, requires further work.

## Appendix

This is a bunch of formulae deriving which seemed like an obvious thing to do, but which turned out so messy I'm not sure they're useful.

1.We can measure the "generalizability" of an environment. That is: Given an environment E and a desired level of performance r, what is the "range" centered on a random state ei within which lie environment-states ek such that the optimal policies for these states are likely near-optimal for the initial state ei?The answer, as far as I can tell, is this monstrosity:

g(r):=1|E|∑ei∈Emaxl∈[0;+∞)(l | P(optimality(ei,aopt|e∼unif({ek | DISTE(ek||ei)≤l}))optimality(ei,aopt|ei)≥r)≈1)Where e∼unif({ek | DISTE(ek||ei)≤l}) is an environment-state sampled from the uniform distribution over the set of states not farther than l from ei.

For a given r≈1, this function can be used to measure the intuition I'd referred to back in Part 1: how "tailor-made" do policies need to be to do well in a given environment?

⟨d⟩:=1|E|∑ei∈Eminek∈ECT(DISTE(ei||ek))2.Suppose we have an M-sized training set CT, and the set of corresponding environment-states ECT, such that every policy in CT is optimal for some environment-state in ECT. We can define the average distance from a random environment-state to one of the states in ECT:

⟨d⟩>g(r)3.And now, for a given environment E, a training set CT of size M, and a target performance r, we can theoretically figure out whether the ML model would need to develop a world-model. The condition is simple:This, in theory, is a powerful tool. But the underlying functions seem quite... unwieldy. Faster/simpler approximations would need to be found before it's anywhere close to practically applicable.

^{^}Expanding the model to support other kinds of utility functions from Agents Over Cartesian World Models seems possible, if a bit fiddly.