Aligning Agents, Tools, and Simulators
This post was written as part of AISC 2025. Introduction In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives. This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not. We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation. Alignment for Agents The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows: 1. Optimizing any set of values pushes those values not considered to zero. 2. People care about more things than we can rigorously define. 3. All values are interconnected, such that destroying some of them will destroy the capacity of the others to be meaningful. 4. Therefore, any force powerful enough to actualize some set of definable values will destroy everything people care about. The above logic is an application of Goodhart's Law, the simplified version of which states that a measure that becomes a target ceases to be a good measure. Indeed, all of the classic problems of alignment can be thought of in terms of Goodhart's Law: 1. Goal-misspecification: the developer has a goal in mind, but writes an imperfect proxy as a reinforcement mechanism. Pursuit of the proxy causes the system to diverge from the developer's true goal. 2. Mesa-optimization: the AI's goals are complex and feedback is sparse, so it generates easier-to-measure subgoals. Pursuit of the proxy subgoals cause the system to diverge from its own initial goal. 3. Goal-misgeneralization: the AI’s goals optimize for the training data, which does not match the real world. Optimizing to the training data causes the AI to act according to patterns specific to idiosyncrasies of th