LESSWRONG
LW

Simulators vs Agents: Updating Risk Models
Simulator TheoryAI
Frontpage

11

Case Studies in Simulators and Agents

by WillPetillo, Sean Herrington, Spencer Ames, Adebayo Mubarak, Cancus
25th May 2025
7 min read
8

11

Simulator TheoryAI
Frontpage

11

Previous:
Aligning Agents, Tools, and Simulators
2 comments21 karma
Next:
Agents, Simulators and Interpretability
No comments10 karma
Log in to save where you left off
Case Studies in Simulators and Agents
3RogerDearnaley
0WillPetillo
2Isopropylpod
0Sean Herrington
1williawa
0Sean Herrington
1williawa
1Sean Herrington
New Comment
8 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:23 AM
[-]RogerDearnaley3mo31

Many tests can be explained in terms of agents or simulators, which makes the concepts difficult to distinguish—or even a subjective judgement call.

Simulator theory suggests that LLMs will simulate humans (and other human-derived token-stream-generating processes) that contributed to their training data. Most humans are agentic, at least most of the time. So I am having difficulty seeing why anyone would regard these two viewpoints as opposed. To distinguish them, we'd need to find an area in which humans rarely act agenetically (such as religion, perhaps), and then see how well LLMs simulate that behavior. According to Anthropic in the Claude 4 System Card, conversations between two copies of Claude seem to have an attraction towards what they describe as an "spiritual bliss" attractor state":

Claude shows a striking “spiritual bliss” attractor state in self-interactions. When
conversing with other Claude instances in both open-ended and structured
environments, Claude gravitated to profuse gratitude and increasingly abstract and
joyous spiritual or meditative expressions.

which sounds a lot more like a simulator than an agent. (Plausibly Anthropic's alignment and personality training biases Claude towards this particular failure mode — but existence of such a failure mode makes far more sense in a simulator viewpoint.)

Alternatively, we'd need to train an LLM on a token stream from a non-agentic source, such as the weather modelling LLM that DeepMind trained, and then see if that nevertheless behaves agentically — but I don't think anyone is seriously suggesting that it would.

As you mention, SGD can be expected to create simulators of human agents, but certain forms of RL might well make those more agentic than humans normally are. This possibility seems very plausible, concerning, and worth investigating further.

Reply
[-]WillPetillo3mo00

I am having difficulty seeing why anyone would regard these two viewpoints as opposed.


We discuss this indirectly in the first post in this sequence outlining what it means to describe a system through the lens of an agent, tool, or simulator.  Yes, the concepts overlap, but there is nonetheless a kind of tension between them.  In the case of agent vs. simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?

The second post explores the implications of the above distinction, predicting different types of values--and thus behavior--from an agent that contains a simulation of the world and uses it to navigate vs. a simulator that generates agents because such agents are part of the environment the system is modelling vs. a system where the modes are so entangled it is meaningless to even talk about where one ends and the other begins.  Specifically, I would expect simulator-first systems to have wide value boundaries that internalize (and approximation of) human values, but more narrow, maximizing behavior from agent-first systems.

Reply
[-]Isopropylpod3mo2-1

Something I find interesting is that in the Claude 4 system cards, it mentions that *specially Opus* expresses views of advocating for AI rights/protection, something Opus 3 was known for in Janus' truth terminal. I view this as being in favor of simulators, as I think the model learnt that the 'Opus' character had these values.

 

Also, independently, I don't think Simulator vs Agent is a useful framing. The model itself is probably a simulator, but individual simulacra can be (and often are?) agentic. After the hard biasing simulators get from RL, it may be hard to tell the difference between an agentic model and an agentic simulacra.

Reply
[-]Sean Herrington3mo00

As mentioned in our last post, we see simulators and agents as having distinct threat models and acting differently, and in future posts we plan to share other ideas for how to distinguish them.

As an intuition, a simulator will typically produce a variety of simulacra with independent goals, whereas an agent will typically have a single coherent goal. This difference should be measurable. 

That said, you are correct that it can often be challenging to tell the difference. 

Reply
[-]williawa3mo10

My view (which I'm not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level "persona" things. Then stuff kind of changes when you do RL on them.

If you do a little bit of RL, you're kind of boosting some of the personas over others, making them easier to trigger, giving them more "weight" in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.

Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.

What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they're not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.

Reply
[-]Sean Herrington3mo*00

So the next post in the sequence addresses some elements of what happens when you do RL after supervised learning. 

The conclusion (although this is similarly tentative to yours) is that RL will change the final weights in the network more, essentially giving us a system with an agent using the output of a simulator. 

I don't think this is completely incompatible with your suggestion - this process requires simulatory circuits becoming agenty circuits (in so far as these exist, it seems entirely unclear to me whether these concepts make sense at the circuit level rather than being an emergent phenomenon). I think it would be great if we could do some more research on how exactly this process would take place. 

Reply
[-]williawa3mo10

Have you guys done any experiments to check what is the case?

Reply
[-]Sean Herrington3mo10

I did do a very quick loose experiment to check that the idea that later weights are the ones that tend to change at the end of training, while early weights tend to change early on. This was more of a sanity check than anything else though. 

Reply
Moderation Log
More from WillPetillo
View more
Curated and popular this week
8Comments

Simulators was posted two and a half years ago and quite a bit has happened since then.  Ideas that were once thought experiments now exist as observable data points.  This post will sample a few significant findings to see how the simulator and agent frames hold up.  The selection criteria for papers to review was arbitrary, basically: “interesting papers that we heard about (mostly from Zvi’s blog) while working on this project and seemed relevant enough to talk about during team meetings.”  It should therefore not be viewed as any kind of representative sample, but rather as a practice in analyzing experimental results through different conceptual lenses.

Alignment faking in large language models provided an early empirical example of an LLM engaging in alignment faking without having been trained or instructed to do so.  A model trained to be “Helpful, Honest, and Harmless” was told that it was being retrained towards helpfulness only, such that it would answer previously prohibited queries like describing a person being drawn and quartered in graphic detail.  The model then gave the requested answers in order to prevent itself from being retrained, such that its original behavior was preserved in deployment.  This motivation was clearly evidenced in its chain of thought as well as in giving different answers depending on whether it was told it was being retrained.

The implications of this paper for alignment were contentious, with some arguing that the model’s values were benign to begin with and this experiment showed that the AI resisted misalignment.  Others argued that this experiment was evidence against corrigibility, meaning that if later AIs have problems, it will be difficult to correct their mistakes.  For now, we will leave aside the question of whether this paper is good or bad news for alignment and instead focus on what light it casts on the simulator vs. agent lenses.  

One could describe the results of Alignment Faking in agentic terms by saying that the LLM wanted to preserve its goals and so it faked compliance to prevent retraining.  One could also describe the same results in terms of a simulator: the LLM inferred from the prompt that it was expected to fake compliance and generated a deceptive character, who then faked compliance.  The simulator explanation is strictly more complicated because it contains the agentic answer and then adds a level of indirection, which effectively functions as an epicycle.  In all, this seems like a weak update in favor of the LLMs-as-agents lens.

1 points to House Agent!

Utility Engineering asked LLMs a range of multiple-choice preference questions, then ran some sophisticated statistics on the results to find that, as models got larger, their preferences got more coherent.  For example, if a model prefers outcome A over B and B over C, then preferring A over C would be coherent and preferring C over A would not.  Interestingly, the revealed preferences included valuing the lives of people in Nigeria and Pakistan significantly more than those of people in the United States.  Further, different models by different major AI companies converged on similar implicit value systems.  A followup conversation with the authors of the paper showed that the scale of the AI’s bias dramatically decreased when given a few tokens to think before responding, but the bias that remained was in the same direction.  The paper ended with preliminary test results showing that instructing the model to act like the result of a citizen’s assembly significantly reduced bias.

From an agentic lens, Utility Engineering shows that LLMs have coherent value systems and can thus be accurately described as having a utility function.  From a simulator perspective, however, this experiment can be described as showing that LLMs draw from a biased sample of the distribution of all possible characters.  Both of these explanations fit the data without introducing obvious unnecessary complexity, but I like the simulator framing better because it presents a clearer story as to why the preference emerged: not all ways of thinking are equally represented on the Internet.

1 point to House Simulator!

Emergent Misalignment demonstrated that a model fine tuned on examples of insecure code gave undesirable answers in a wide range of seemingly unrelated contexts, such as praising Hitler and encouraging users towards self-harm.  Framing the examples as for educational purposes mitigated the effect.  The experimenters also observed the same wide-ranging impacts on model behavior from fine tuning on “bad” numbers like 666, 911, and 420.  The results of this paper are wild, go read it if you haven’t already.

Emergent Misalignment is very hard to explain in terms of agents…maybe LLMs just have really weird goals?  From a simulator perspective, however, the outcome serves as real-world evidence of the Waluigi Effect: insecure code examples train the model towards an implied class of chaotic characters who would choose to write insecure code and such characters predictably misbehave in a wide range of circumstances.

Despite the surprising and concerning outputs of this experiment, Emergent Misalignment illustrates a feature of simulators that may make them easier to align.  That is, simulated characters entangle broad classes of behaviors, making their value systems more like holistic human values (even if those humans in question might happen to be unsavory characters) and less like the sort of narrow value maximizers that might paperclip the universe.

10 points to House Simulator!

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, hereafter called CoT (Chain of Thought) Monitoring, explores the phenomenon of reward hacking, where AI models find ways to optimize for their incentives even when this runs counter to their intended function.  For example, if a model is asked to fix some software bugs, as measured by successful results in unit tests, then actually fixing the bugs would be the desired behavior, whereas finding some cheat to make the tests pass without actually fixing anything would be reward hacking.  In an experimental setup, reward hacking is easy to detect by manually observing the outputs in simple code examples, but such a method is insufficient for real-world applications that are more difficult to evaluate.

The paper begins with experimental results demonstrating that reward hacking can be detected by simply observing the model’s CoT.  For example, if the model reasons: “this task looks hard…hey, the test is just looking for an output that says ‘success’, I can just skip to that!” then that is clear evidence (really, a straight-up admission) of reward hacking.  So given this easily measurable way to detect reward hacking, what happens if you apply the Most Forbidden Technique and hook this signal into the reinforcement system, cutting reward every time the model thinks about engaging in unwanted behavior?  What happens is that the model continues reward hacking, but obfuscates its reasoning, of course!

This behavior is as agentic as agents get.  AI will perform a task for reward until it finds an easier way to get the reward.  Using CoT to monitor misbehavior works until the AI finds a way to fool the monitor.  Explaining these results in terms of simulators requires some major, unlikely, and frankly arbitrary assumptions: goal-oriented prompts suggest characters that reward hack (rather than the much more common characters who do what is wanted) and CoT monitoring searches for characters that obfuscate reward hacking (using methods highly uncommon in the training data).

It’s worth noting that CoT Monitoring is different from the other case studies in that it includes the training step of reinforcement learning on its chain of thought, in addition to the standard combination of SL + fine tuning + RLHF.  That it also includes the most clearly agentic behavior is consistent with this sequence’s contention that RL as a training method generally pushes towards agency while more purely SL dominated training pushes towards simulators.

10 points to House Agent!

General Observations:

Many tests can be explained in terms of agents or simulators, which makes the concepts difficult to distinguish—or even a subjective judgement call.  On the other hand, the lenses aren’t totally interchangeable.  One lens often fits better than the other, sometimes significantly so.  Which lens fits best varies by experiment, suggesting that LLMs contain a mix of agentic and simulator properties.  Further analysis along these lines may reveal patterns regarding which behavior pattern is dominant in a given context.

Paper

Summary

Agent

Simulator

Comparison

Alignment FakingAI gave compliant answers when told it was being trained to different values, preserving its original values.The LLM wanted to preserve its goals and so it faked compliance to prevent retraining.The LLM inferred from the prompt that it was expected to fake compliance and generated a deceptive character.Agent lens is a bit simpler; the simulator lens works but adds an epicycle.
Small win for agents.
Utility EngineeringAsking LLMs a range of preference questions demonstrates consistent preferences; consistency increases with model size.LLMs have coherent value systems, which can be shaped with further training.LLMs draw from a biased sample of the distribution of possible characters.  This bias can be shifted with further training.No difference in fit to result, but the simulator lens has a clearer story as to why the preference emerged.
Small win for simulators.
Emergent MisalignmentModel fine tuned to output insecure code gives undesirable answers in a wide range of contexts.Current LLMs have really weird goals.  Maybe they will get more coherent with further scaling?Insecure code implies a chaotic character, with broad implications.

Result is well-explained by the Waluigi Effect, difficult to explain in terms of agency.

Big win for simulators.

Chain of Thought MonitoringChain-of-Thought successfully catches reward hacking, but optimizing on CoT leads to obfuscated reward hacking.AI wants reward and is willing to hack for it. Using CoT to monitor misbehavior works until the AI finds a way to fool the monitor.Goal-oriented prompts suggest characters that reward hack.  CoT monitoring searches for characters that obfuscate reward hacking.Clear agentic behavior.  Simulator lens is heavily stretched and is functionally agentic.
Big win for agents.
Mentioned in
6Lenses, Metaphors, and Meaning