Agents, Simulators and Interpretability

Sean Herrington; WillPetillo; Spencer Ames; Can Narin; Adebayo Mubarak

We have already spoken about the differences between tools, agents and simulators and the differences in their safety properties. Given the practical difference between them, it would be nice to predict which properties an AI system has. One way to do this is to consider the process used to train the AI system.

The Training process

Predicting an AI’s properties based on its training process assumes that inner misalignment is negligible and that the model will develop properties compatible with the training goal given to it.

An example would be comparing a system such as AlphaZero, which was given no instructions in its training other than “win the game”, to an LLM with the task “predict the next token”. As AlphaZero is given a terminal goal which is some distance in time away, it seems likely that it will engage in agentic behaviour by:

Only caring about the end result
Setting instrumental goals which will help it move towards a win (such as having the goal of taking pieces)

An LLM is given next token prediction as its task, and is therefore more likely to:

Predict the next token with little consideration^[1] for what might come after.
Allocate all resources to the current moment, as there are no instrumental steps to perform before predicting the next token.

This change in behaviour depending on training objective should come as no surprise to anyone who has read the original simulator post.

The more interesting question is what happens once you add extra bells and whistles to this basic training process - what happens once the LLM has been fine-tuned, RLHFed and had some chain-of-thought added before it is trained on the results of the chain-of-thought, it's gone through a divorce, lost custody of its children, etc, etc. For this it might be useful to think a little more carefully about what happens within the network when it goes through multiple stages of training.

Mechanistic understanding

To preface: this is a speculative picture of what might be happening when neural networks are trained. The point here is to get an idea of what might occur, in such a fashion as to stimulate thought about how one might be capable of testing for such phenomena.

We shall start with a comparison of agents and simulators on an interpretability level (we shall, for now, ignore tools as toolness is mostly a property of its user). A first thought one might have is that agents are likely to have a representation of their end goal somewhere in their network. In idealised agents, all actions relate to a coherent terminal goal, and this might be represented directly. Looking at our canonical agent AlphaZero, the probability of winning is actually explicitly predicted by the network.

But it’s not entirely clear how well this assumption holds in practice - notice that you do not need to think about making enough money to live happily forever after on a yacht with a fishing rod and a butler in order to go to work^[2]. In the same way, it is very possible that an AI could represent a variety of heuristics which together correlate really closely to a coherent end goal, without actually giving us the joy of finding that representation hidden somewhere in the weights.

One of the notable features distinguishing simulators from agents is their ability to simulate multiple different characters. Restricting ourselves to modern day LLMs, we notice that the information flowing through the network is held in high dimensional “embedding” vectors holding the nuance of the context.

It is possible that part of the information in these vectors is a space of “characters”, with potential directions within this space corresponding to concepts such as “friendly”, “Australian” or “quick to anger”.

The actual implementation of characters would therefore be compressed down to a set of characteristics in a vector space. These characteristics could then be acted upon by the circuits in the network, performing actions such as suppressing swear words for “friendly” characters and increasing them for “sailor” characters^[3].

This is all well and good but in Agents, Tools & Simulators we discussed the possibility of entities having various mixes of agent and simulator behaviour. The information above may be helpful for distinguishing between pure simulators and pure agents (indeed we have kept this in the post to help people who might want to try), but it is unclear how it might help us when attempting to differentiate more subtle distinctions.

Mech interp and the training process

The kind of high-level properties we are looking for here will be extremely difficult to determine purely through mechanistic interpretability. Looking at the training process can give insight but hits a wall once we talk about multiple training processes. What happens if we combine both approaches?

First we need a basic picture of how a neural network develops over time. A nice example of this is given in Visualizing and Understanding Convolutional Networks, where we see that a convolutional network first learns to differentiate simple features in early layers, followed by building these up into more complex combinations later in the network, working down the layers as training progresses.

Generalising this concept to other architectures, and assuming that the network is feed-forward, it is likely that^[4]:

The circuits and concepts which eventually develop in the later layers of the network must depend on those developed in the earlier layers.
The concepts in the later layers must therefore develop fully later than those in the earlier layers.
Concepts in later layers depend on multiple concepts from earlier layers.
Concepts in early layers influence multiple concepts in later layers (not necessary from 3, but likely in a range of architectures).
Modifying a concept early in the network therefore changes the outcomes of multiple circuits later in the network.
In order to modify a concept early in the network, it must be beneficial on balance over multiple downstream uses.
Changes in late stages of training and during fine-tuning are therefore more likely to take place later in the network.
These changes are therefore likely to use pre-existing concepts learned from pre-training.

If we take these statements and apply them to simulator theory for LLMs, we see that we expect something like RLHF will most likely take the output of the early circuits and tweak the later layers to produce output more likely to be approved by humans. This means that the entire context of the prompt, extracting nuances, figuring out the characters at play, what characteristics they have and so on, is kept essentially static, save possibly for some edits which are globally helpful to the model in the new context.

When comparing the agentic and simulator frames on this result, it seems that the model is for all intents and purposes a simulator for the majority of the network with the final few layers “using the output of the simulation”. Note: We are unsure that RLHF produces actions which value instrumental goals and leave the classification of RLHF trained networks up for discussion.

As reinforcement learning on multi step problems has a tendency to produce agents, it is likely that modern chain of thought models are essentially agents using the output of a large simulator. Increasing the amount of RL will likely increase the fraction of the network dedicated to the agentic type of processing.

Summary

We can currently guess at whether an AI will function as a simulator or an agent based on its training process, whereas relying on interpretability alone may run into fundamental challenges. Such classifications become less crisp when using multiple training processes, but we broadly expect that modern CoT LLMs trained with RL will act as agents using the output of a simulator, with simulator behavior encoded primarily in the early layers of the network and agentic behavior encoded primarily in the later layers.

In this post we have:

Noted that while we can currently classify Simulators and Agents using their training processes, this becomes more challenging when multiple training processes are used.
Attempted an explanation of how agents and simulators might work mechanistically.
Found that using interpretability to find the exact mix of frames appropriate for any given network is challenging.
Found that combining interpretability techniques with knowledge about training methods can produce insights into the appropriate interpretations for entities with multiple stages of training.
Used this to predict that modern CoT LLMs trained with RL will act as agents using the output of a simulator.

^{^}
Many models are now trained with multi-token prediction, as promoted by this paper. Interestingly, this is because of increased reasoning capabilities, potentially due to increased agentic behaviour, although we will for simplicity mainly assume single token prediction for the rest of the article.
^{^}
Admittedly on LessWrong the dream is more likely to be a transhuman utopia, but the point is the same
^{^}
Although we emphasise again that we have no empirical evidence to support this, we believe this would be a simple and well compressed way to represent the information
^{^}
Epistemic status: I tested this out briefly on a couple of small models, which confirmed some of the intuitions, although early layers still changed. I'd expect the assumptions to hold more steadfastly in larger models because more concepts are being dealt with.

12

Agents, Simulators and Interpretability

12

The Training process

Mechanistic understanding

Mech interp and the training process

Summary

12

12