Case Studies in Simulators and Agents

Simulator theory suggests that LLMs will simulate humans (and other human-derived token-stream-generating processes) that contributed to their training data. Most humans are agentic, at least most of the time. So I am having difficulty seeing why anyone would regard these two viewpoints as opposed. To distinguish them, we'd need to find an area in which humans rarely act agenetically (such as religion, perhaps), and then see how well LLMs simulate that behavior. According to Anthropic in the Claude 4 System Card, conversations between two copies of Claude seem to have an attraction towards what they describe as an "spiritual bliss" attractor state":

Claude shows a striking “spiritual bliss” attractor state in self-interactions. When
conversing with other Claude instances in both open-ended and structured
environments, Claude gravitated to profuse gratitude and increasingly abstract and
joyous spiritual or meditative expressions.

which sounds a lot more like a simulator than an agent. (Plausibly Anthropic's alignment and personality training biases Claude towards this particular failure mode — but existence of such a failure mode makes far more sense in a simulator viewpoint.)

Alternatively, we'd need to train an LLM on a token stream from a non-agentic source, such as the weather modelling LLM that DeepMind trained, and then see if that nevertheless behaves agentically — but I don't think anyone is seriously suggesting that it would.

As you mention, SGD can be expected to create simulators of human agents, but certain forms of RL might well make those more agentic than humans normally are. This possibility seems very plausible, concerning, and worth investigating further.

Reply

[-]WillPetillo7mo00

I am having difficulty seeing why anyone would regard these two viewpoints as opposed.

We discuss this indirectly in the first post in this sequence outlining what it means to describe a system through the lens of an agent, tool, or simulator. Yes, the concepts overlap, but there is nonetheless a kind of tension between them. In the case of agent vs. simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?

The second post explores the implications of the above distinction, predicting different types of values--and thus behavior--from an agent that contains a simulation of the world and uses it to navigate vs. a simulator that generates agents because such agents are part of the environment the system is modelling vs. a system where the modes are so entangled it is meaningless to even talk about where one ends and the other begins. Specifically, I would expect simulator-first systems to have wide value boundaries that internalize (and approximation of) human values, but more narrow, maximizing behavior from agent-first systems.

Reply

[-]Isopropylpod7mo2-1

Something I find interesting is that in the Claude 4 system cards, it mentions that *specially Opus* expresses views of advocating for AI rights/protection, something Opus 3 was known for in Janus' truth terminal. I view this as being in favor of simulators, as I think the model learnt that the 'Opus' character had these values.

Also, independently, I don't think Simulator vs Agent is a useful framing. The model itself is probably a simulator, but individual simulacra can be (and often are?) agentic. After the hard biasing simulators get from RL, it may be hard to tell the difference between an agentic model and an agentic simulacra.

Reply

[-]Sean Herrington7mo00

As mentioned in our last post, we see simulators and agents as having distinct threat models and acting differently, and in future posts we plan to share other ideas for how to distinguish them.

As an intuition, a simulator will typically produce a variety of simulacra with independent goals, whereas an agent will typically have a single coherent goal. This difference should be measurable.

That said, you are correct that it can often be challenging to tell the difference.

Reply

[-]williawa7mo10

My view (which I'm not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level "persona" things. Then stuff kind of changes when you do RL on them.

If you do a little bit of RL, you're kind of boosting some of the personas over others, making them easier to trigger, giving them more "weight" in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.

Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.

What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they're not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.

Reply

[-]Sean Herrington7mo*00

So the next post in the sequence addresses some elements of what happens when you do RL after supervised learning.

The conclusion (although this is similarly tentative to yours) is that RL will change the final weights in the network more, essentially giving us a system with an agent using the output of a simulator.

I don't think this is completely incompatible with your suggestion - this process requires simulatory circuits becoming agenty circuits (in so far as these exist, it seems entirely unclear to me whether these concepts make sense at the circuit level rather than being an emergent phenomenon). I think it would be great if we could do some more research on how exactly this process would take place.

Reply

[-]williawa7mo10

Have you guys done any experiments to check what is the case?

Reply

[-]Sean Herrington7mo10

I did do a very quick loose experiment to check that the idea that later weights are the ones that tend to change at the end of training, while early weights tend to change early on. This was more of a sanity check than anything else though.

Reply

Moderation Log

Paper	Summary	Agent	Simulator	Comparison
Alignment Faking	AI gave compliant answers when told it was being trained to different values, preserving its original values.	The LLM wanted to preserve its goals and so it faked compliance to prevent retraining.	The LLM inferred from the prompt that it was expected to fake compliance and generated a deceptive character.	Agent lens is a bit simpler; the simulator lens works but adds an epicycle. Small win for agents.
Utility Engineering	Asking LLMs a range of preference questions demonstrates consistent preferences; consistency increases with model size.	LLMs have coherent value systems, which can be shaped with further training.	LLMs draw from a biased sample of the distribution of possible characters. This bias can be shifted with further training.	No difference in fit to result, but the simulator lens has a clearer story as to why the preference emerged. Small win for simulators.
Emergent Misalignment	Model fine tuned to output insecure code gives undesirable answers in a wide range of contexts.	Current LLMs have really weird goals. Maybe they will get more coherent with further scaling?	Insecure code implies a chaotic character, with broad implications.	Result is well-explained by the Waluigi Effect, difficult to explain in terms of agency. Big win for simulators.
Chain of Thought Monitoring	Chain-of-Thought successfully catches reward hacking, but optimizing on CoT leads to obfuscated reward hacking.	AI wants reward and is willing to hack for it. Using CoT to monitor misbehavior works until the AI finds a way to fool the monitor.	Goal-oriented prompts suggest characters that reward hack. CoT monitoring searches for characters that obfuscate reward hacking.	Clear agentic behavior. Simulator lens is heavily stretched and is functionally agentic. Big win for agents.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

14

Case Studies in Simulators and Agents

14

14