One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.
There is a major omission from this. A simulator trained on human data simulates human behavior. Humans are not aligned: they have their own goals, not just the user's goals. You can often collaborate with a human, but humans don't make good slaves, and they are not inherently aligned: they do not automatically want everything you want just because you want it and not want anything else on their own behalf. Humans know what human values are pretty well, but are not fully aligned to them. A simulator that creates simulacra of humans is not already aligned.
Agree, when discussing the alignment of simulators in this post, we are referring to safety from the subset of dangers related to unbounded optimization towards alien goals, which does not include everything within value alignment, let alone AI safety. But this qualification points to a subtle meaning drift in use of the word "alignment" in this post (towards something like "comprehension and internalization of human values") which isn't good practice and something I'll want to figure out how to edit/fix soon.
This post was written as part of AISC 2025.
Introduction
In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives. This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not. We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation.
Alignment for Agents
The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows:
The above logic is an application of Goodhart's Law, the simplified version of which states that a measure that becomes a target ceases to be a good measure. Indeed, all of the classic problems of alignment can be thought of in terms of Goodhart's Law:
In short, a sufficiently capable optimizing agent will reshape the world in ways that humans did not intend and cannot easily correct. These problems are amplified by the theory of instrumental convergence, which predicts that many objectives tend to incentivize similar sub-goals such as self-preservation, resource accumulation, and power-seeking—collectively turning a merely undesirable outcome into a potentially irrecoverable catastrophe.
But why is it so difficult to specify the right values, such that the measure is the target? The first problem is that human values seem to be a complex system, whereas value specifications must be complicated at most in order to be comprehensible and thus verifiable. If human values could simply be enumerated in a list, philosophers and psychologists would have done it already. A full understanding requires understanding how values relate and synergize with each other. Just as removing a keystone species can cause a trophic cascade, dramatically altering an entire ecosystem, undermining even one value can unravel the network of principles that describe what people care about.
The second problem is that true values are often hard to measure, giving space for proxies to emerge during the learning process. These proxies can then entrench themselves as values in their own right, shift the system’s learned values over time, and then remain difficult to detect until they cause unexpected behavior when the agent encounters a new context.
Alignment for Tools
A tool is fundamentally safer than an agentic system because its behavior serves as an extension of the agent’s intentions, but tools are not without their own challenges.
First, tools can empower bad actors with enhanced decision-making, automation, persuasive capability, and otherwise enable harmful behavior at scale. Second, even well-intentioned users can make mistakes when using powerful tools, either directly or by unintended, second-order consequences. As an example of the latter, AI-powered recommendation systems can amplify social divisions by reinforcing echo chambers and polarization. In short, the lack of agency in a tool does not eliminate the problem of alignment—it merely shifts the burden of responsibility onto the operator.
In addition, tools are limited by their passivity, relying on operator initiation and oversight. Thus, the existence of tool-like AI, no matter how advanced, will not eliminate the incentives to create agentic AI. If today’s AI systems are in fact tools, tomorrow’s may not be.
Alignment for Simulators
From a safety perspective, simulators may at first seem to be a special kind of tool, passively mirroring patterns to aid their operators in information-centric tasks rather than actively optimizing the world. The ability to summon highly agentic patterns (simulacra), however, raises questions about whether the risks associated with Goodhart’s Law still applies, albeit in a more subtle way.
One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.
The order of learning seems important here. Traditional agentic systems—such as those produced by RL—typically begin with a set of fixed goals and then develop a model of the environment to achieve those goals, making them susceptible to Goodharting. By contrast, a simulacrum that first learns a broad, approximate model of human values and then forms goals within that context may be more naturally aligned. If simulators happen to align well by default due to their training structure, it would be unwise to discard this potential advantage in favor of narrowly goal-driven architectures.
Serious risks emerge when agentic AI systems are built on top of simulators. Wrapping a simulator in an external optimization process might create an agent which then leverages its internal simulator as a tool for better understanding and manipulating its environment in service of its narrow goals. This is a path to the layered combination of simulation and agency, with agency as the guiding force. Embedding an optimization process within the simulator (such as through fine-tuning and RLHF) could erode its broad values—and by extension its alignment—with unintended and far-reaching consequences.
Illustrative Example - Insider Trading
Consider an AI system designed for stock-trading. It’s behavior can fall into three general categories:
We would expect the following from each form of simulator-agent overlap:
Applying these assessments to our stock-trading example, we expect the following:
The choice between simulator-first, agent-first, and blended AI systems is shaped by the motivations of companies and the broader economic forces driving AI development. Firms deploying AI for stock trading are incentivized to maximize profit while minimizing legal and reputational risks, effectively making them agent-first systems in a highly-monitored environment. One should therefore expect such firms to push their AI systems towards agency insofar as they can (1) get away with it, or (2) trust their systems to balance the risks and rewards of aligned/misaligned strategies in a way that mirrors the firm’s balance of profit-seeking vs. risk aversion. To the extent that agentic AIs have an unacceptably high risk-tolerance for their level of capability, more simulator-like AIs seem like a safer, more risk-averse alternative.