In “Hidden Incentives for Auto-Induced Distributional Shift”, we introduce several “unit tests” meant to determine whether a system will pursue instrumental incentives to influence its environment.  A system that does so would not be “consequence-blind” (using terminology from, and might seek power or to deceive or manipulate humans it interacts with.

The main unit test basically consists of a simple environment where a system plays the prisoner’s dilemma against its future and past self, i.e. “cooperating” now reduces present reward, but increases reward at the next time step.

The original purpose of this unit test was to understand which training set-ups or learning algorithms would produce systems that were consequence-blind.  For instance, one surprising result shows that systems trained with myopic (gamma=0) Q-learning of a particular form may fail to be consequence-blind.  Supervised learning with a meta-learning outer-loop can as well.  

The self-supervised pre-training of LLM seems like it should produce a consequence-blind model, as argued in the linked LW post.  However, there are speculative concerns that it might fail to, e.g. because of mesa-optimization or functional decision theory.  It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution.  Either type of failure might be informative, but I would find it particularly concerning if failures do not seem “explained away” by imitation.

The main issue I see with applying this unit test to pretrained LLMs is answering the question of “How is the goal communicated to the system?”  The unit test is basically designed to evaluate learning algorithms, not models.  I see two possibilities:

  1. [Less Straightforward]: Attempt to communicate the goal via prompting.  Failure here would be hard to interpret and might be quite prompt-dependent.
  2. [More Straightforward]: fine-tune the LLM with myopic RL in the unit test environment.  Failure here would be more concerning.  Personally, I would consider it fairly strong evidence that the system is malign, and would suggest treating the model with extreme precaution (e.g. not deploying, being very careful about running it for testing/training or otherwise interacting with it) pending further investigation.  On the other hand, if the system was truly malign, it might also be smart enough to realize it was being tested and behave deceptively, especially given the paper is probably in the training data.  In my mind, the basic goal here would be to “catch” the system if it becomes malign during pretraining, before it realizes it’s smart enough to realize it’s being tested.  To best realize this goal might involve creating more complex environments inspired by this; the simplicity of this test might be a “give away” for a system with a moderate level of situational awareness.  This starts to look more like honeypotting.



New Comment