Whack-a-mole: generalisation resistance could be facilitated by training-distribution imprintation
TL;DR Models might form detailed representations of the training task distribution and use this to sandbag at deployment time by exploiting even subtle distribution shifts. If successive rounds of training update the model’s representations of the task distribution rather than its underlying tendencies then we could be left playing whack-a-mole: having to repeatedly train the model after even the smallest distribution shifts. I give a detailed description of this failure mode, potential interventions that labs might take, and concrete directions for model organisms research. Epistemic status: This started off as an internal memo to explain my research proposal of constructing ‘model organisms resisting generalisation’ to my MARS 4.0 team; I then realised it might be valuable to share with the community. I see this as ‘an attempt to articulate a detailed example of a generalisation threat’. I haven’t had heaps of time to iterate on the post yet, and realise many of the ideas are already circulating, but think there is value in this synthesis, in particular with regard to fixing terminology and prioritising future empirical research directions. Background As I understand it, AI companies currently apply the majority of post-training compute on curated task suites. A good chunk of these are environments for Reinforcement Learning with Verifiable Rewards (RLVR), typically agentic coding tasks. There are also task suites corresponding to agentic alignment scenarios, as well as standard chat-style RLHF scenarios. There is little effort to ensure the training suite accurately reflects or covers the space of desired deployment situations. As such, labs are largely relying upon favourable generalisation for good performance. For example, a lab might hope that a model trained with a combination of RLVR on coding tasks, RLHF on chat tasks, and some limited RLHF on coding environments will end up being a helpful coding assistant. However, the current success of generalisa