In “Hidden Incentives for Auto-Induced Distributional Shift”, we introduce several “unit tests” meant to determine whether a system will pursue instrumental incentives to influence its environment. A system that does so would not be “consequence-blind” (using terminology from https://www.lesswrong.com/posts/c68SJsBpiAxkPwRHj/how-llms-are-and-are-not-myopic), and might seek power or to deceive or manipulate humans it interacts with.
The main unit test basically consists of a simple environment where a system plays the prisoner’s dilemma against its future and past self, i.e. “cooperating” now reduces present reward, but increases reward at the next time step.The original purpose of this unit test was to understand which training set-ups or learning algorithms would produce systems that were consequence-blind. For instance, one surprising result shows that systems trained with myopic (gamma=0) Q-learning of a particular form may fail to be consequence-blind. Supervised learning with a meta-learning outer-loop can as well.
The self-supervised pre-training of LLM seems like it should produce a consequence-blind model, as argued in the linked LW post. However, there are speculative concerns that it might fail to, e.g. because of mesa-optimization or functional decision theory. It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution. Either type of failure might be informative, but I would find it particularly concerning if failures do not seem “explained away” by imitation.The main issue I see with applying this unit test to pretrained LLMs is answering the question of “How is the goal communicated to the system?” The unit test is basically designed to evaluate learning algorithms, not models. I see two possibilities: