TL;DR Models might form detailed representations of the training task distribution and use this to sandbag at deployment time by exploiting even subtle distribution shifts. If successive rounds of training update the model’s representations of the task distribution rather than its underlying tendencies then we could be left playing whack-a-mole:...
My research this summer involved designing control settings to investigate black-box detection protocols. When designing the settings, we had to decide between trying to mitigate research sabotage (RSab) or Dangerous Capability Evaluation (DCE) sandbagging. It was not clear to us which to target, or whether a single setting could be...
Epistemic status: I’ve only spent 3 months working on sandbagging, and have had limited, mixed feedback from established researchers on the ideas in this post. However, I have had positive feedback from most junior researchers I asked, so I think it will be helpful to make these ideas concrete and...
Feedback request: Is the time right for an AI Safety stack exchange? Epistemic status: I think this is a really good idea, and most of the ~20 people I’ve asked in the community agree. I am uncertain of what exactly the best product would look like, and whether the community...
TL;DR Models might form detailed representations of the training task distribution and use this to sandbag at deployment time by exploiting even subtle distribution shifts. If successive rounds of training update the model’s representations of the task distribution rather than its underlying tendencies then we could be left playing whack-a-mole: having to repeatedly train the model after even the smallest distribution shifts. I give a detailed description of this failure mode, potential interventions that labs might take, and concrete directions for model organisms research. Epistemic status: This started off as an internal memo to explain my research proposal of constructing ‘model organisms resisting generalisation’ to my MARS 4.0 team; I then realised it might be valuable to share with the community. I see this as ‘an attempt to articulate a detailed example of a generalisation threat’. I haven’t had heaps of time to iterate on the post yet, and realise many of the ideas are already circulating, but think there is value in this synthesis, in particular with regard to fixing terminology and prioritising future empirical research directions. Background As I understand it, AI companies currently apply the majority of post-training compute on curated task suites. A good chunk of these are environments for Reinforcement Learning with Verifiable Rewards (RLVR), typically agentic coding tasks. There are also task suites corresponding to agentic alignment scenarios, as well as standard chat-style RLHF scenarios. There is little effort to ensure the training suite accurately reflects or covers the space of desired deployment situations. As such, labs are largely relying upon favourable generalisation for good performance. For example, a lab might hope that a model trained with a combination of RLVR on coding tasks, RLHF on chat tasks, and some limited RLHF on coding environments will end up being a helpful coding assistant. However, the current success of generalisa