The behavioral selection model for predicting AI motivations
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask. Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal. All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning. This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected. In this post I’ll spell out what this more general principle means and why it’s helpful. Specifically: * I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph. * I’ll discuss the basic implications for AI motivations. * And then I’ll discuss some important extensions and omissions of the behavioral selection model. This post is mostly a repackaging of existing ideas (e.g., here, here, and here). Buck provided helpful discussion throughout the writing process and the behavioral selection model is based on a causal graph Buck drew in an early conversation. Thanks to Alex Cloud, Owen Cotton-Barratt, Oliver Habryka, Vivek Hebbar, Ajeya Cotra, Alexa Pan, Tim Hua, Alexandra Bates, Aghyad Deeb, Erik Jenner, Ryan Greenblatt, Arun Jose, Anshul Khandelwal, Lukas Finnveden, Aysja Johnson, Adam Scholl, Aniket Chakravorty, and Carlo Leonardo Attubato for helpful discussion and/or feedback. How does the behavioral selection model predict AI behavior? The behavioral selection model predicts AI behavior by modelin
An example of intra-agent competition I often use when arguing that long-term motivations tend to win out upon reflection (h/t @jake_mendel): Imagine someone who went to a party last night, got drunk, and now feels terrible and unproductive the next morning.
This person has two competing motivations:
There's an asymmetry: The non-myopic motivation has an incentive to disempower the myopic one (i.e., the next morning the person might want to commit not to drink in the future). Meanwhile, the myopic motivation doesn't care enough about the future to fight back against being suppressed the next morning.
This creates an unstable situation where the... (read more)