Models might form detailed representations of the training task distribution and use this to sandbag at deployment time by exploiting even subtle distribution shifts.
If successive rounds of training update the model’s representations of the task distribution rather than its underlying tendencies then we could be left playing whack-a-mole: having to repeatedly train the model after even the smallest distribution shifts.
I give a detailed description of this failure mode, potential interventions that labs might take, and concrete directions for model organisms research.
Epistemic status: This started off as an internal memo to explain my research proposal of constructing ‘model organisms resisting generalisation’ to my MARS 4.0 team; I then realised it might be valuable to share with the community. I see this as ‘an attempt to articulate a detailed example of a generalisation threat’. I haven’t had heaps of time to iterate on the post yet, and realise many of the ideas are already circulating, but think there is value in this synthesis, in particular with regard to fixing terminology and prioritising future empirical research directions.
As I understand it, AI companies currently apply the majority of post-training compute on curated task suites. A good chunk of these are environments for Reinforcement Learning with Verifiable Rewards (RLVR), typically agentic coding tasks. There are also task suites corresponding to agentic alignment scenarios, as well as standard chat-style RLHF scenarios. There is little effort to ensure the training suite accurately reflects or covers the space of desired deployment situations. As such, labs are largely relying upon favourable generalisation for good performance. For example, a lab might hope that a model trained with a combination of RLVR on coding tasks, RLHF on chat tasks, and some limited RLHF on coding environments will end up being a helpful coding assistant.
However, the current success of generalisation is fragile and we might not expect it to work so well in future systems by default.
I am particularly concerned about the threat of models sandbagging on AI safety research tasks. Many researchers hope that training the model to perform well on safety research tasks will help to mitigate this problem. Recent work has analysed how exploration hacking might make it difficult and/or expensive to train misaligned models to perform well. However, even if exploration hacking is ‘solved’ and we can train models to perform well on a given distribution of training tasks, this might still allow bad behaviour at deployment time if there is a distribution shift between the training and deployment distributions.
This motivates the idea of online training: frequently training models on recent deployment interactions, with a wide range of reward signals that are robustly correlated with the desired behaviour. The hope is that this will minimise the distribution shift between training and deployment distributions and so minimise the risks of misgeneralisation.
However, though online training is well-regarded from the AI safety community, it is not accepted by labs: there appears to be a ‘burden-of-proof’ on the safety community to demonstrate that it is necessary before labs allocate resources into putting it into practice. Furthermore, online training will not work perfectly; in particular, there may still be a subtle distribution shift that a model could use to distinguish the ‘historical’ deployment scenarios used for training from the ‘live’ deployment scenarios of importance.
The aim of this post is to promote discussion around generalisation failures, and to outline directions for further research. I aim to introduce some new ideas and provide suitable language with which to talk about such phenomena.
My primary contributions are as follows:
To better explain these arguments and the motivation behind them, I also propose the following distinctions:
I am currently working on empirical projects to accompany the conceptual ideas in this post with my MARS 4.0 team, and we already have some nice preliminary results. We are actively seeking feedback from the community to inform our empirics. I outline some specific questions of interest at the end of the post. All feedback or comments would be much appreciated!
Before starting properly, here are a few disparate anecdotes that suggest to me that training-distribution imprintation may already be having strong effects:
Generalisation is a poorly-understood phenomenon in ML. In the traditional ML viewpoint, generalisation is remarkable, rather than expected, and can fail due to over-fitting: if you train an MLP too hard on 1000 samples of MNIST the validation loss will eventually start to increase. Over-fitting is also a problem with LLMs, which can memorise training data, or exploit statistical anomalies to solve tasks rather than performing the desired reasoning processes. LLMs might also fail in unfamiliar contexts that take them outside the realm of the bulk of their training distribution. I’m going to refer to this style of failure as a passive generalisation failure.
By contrast I’ll describe failures where certain cognitive processes within models are ‘actively making the model worse’ as active generalisation failures.
I think there is value in using this distinction instead of terminology like ‘deliberate’ or ‘intentional’. Such mentalistic language is difficult to define and arguably over-anthropomorphises systems, or pre-supposes a certain level of coherence. By referring to ‘active cognitive processes’ I seek to allow a view of models as spiky, and context-dependent, which is closer to how I see models, and which I suspect will remain the case for a good while yet. This view of models is influenced by Turner’s shard framing, among other things. One can design empirical settings to reflect the notion of ‘active cognitive processes’ by considering mech-interp-style interventions to disrupt such circuitry, such as steering or activation patching.
In what scenarios might we expect to see active generalisation failures? I think that the sort of scenarios that are already considered in the AI safety literature are a good place to start: think of behaviours in line with resource-acquisition, self-preservation, or goal-preservation. In this post I’ll focus on the threat of sandbagging on AI safety research tasks. I would like to add that these bad behaviours do not have to be due to system-2-style reasoning by a coherent model about what would best further its goals: I think this can also happen due to (i) pattern matching to existing AI safety literature on what behaviours a misaligned model might display (ii) ‘train-time instrumental convergence’: a process where tendencies such as power seeking are incentivised by training, since they are instrumentally useful across a variety of training environments.
For now I’ll just give a couple of particular examples where active generalisation failures might be expected:
These two examples highlight the difference between lateral and temporal distribution-shift. These are core motivations for the rest of the post. Note also, that the rest of this post will focus solely on active generalisation failures (the more interesting type).
Recalling and analysing past experiences on similar tasks seen in training seems like a broadly helpful strategy. Indeed, this is precisely the motivation for retrieval augmented generation (RAG), a popular and powerful tool in capabilities-land.
I think this would be particularly helpful for a model that was optimised hard on a relatively narrow suite of RL environments. Different types of suites might reward different sorts of behaviours - and so it is instrumentally useful for a model to be able to detect which suite it is in so that it can tailor its behaviour to the inferred reward signal. For example, some suites might be pure RLVR, in which the reward signal would incentivise reward hacking (e.g. via unit test tampering), while in a hybrid VR+HF reward signal such behaviour might be penalised. If there are statistical regularities that a model can use to distinguish between the two types of setting then one might expect training to teach a model to recognise these features and condition its behaviour appropriately.
I’ll now describe two ways in which this capability might arise mechanistically.
I think it is helpful to distinguish between two mechanisms that might lead to distribution awareness, both of which can be viewed as forms of out-of-context reasoning (OOCR).
I don’t claim these are mutually exclusive or collectively exhaustive (MECE). Indeed, I suspect that natural distribution awareness would arise in a messy way combining aspects of each of the two mechanisms. Pre-trained models start with a pretty good understanding of the sorts of scenarios that a lab might use for post-training. If during post-training a model learns to ‘point to’ this understanding there will be a dramatic improvement in performance across a range of tasks. Evan describes this concept as a ‘performance overhang’ in his discussion of the likelihood of deceptive alignment. I would expect that training would then gradually shape the model’s understanding of the sort of tasks, and expected graders, rather than build this afresh.
However, I do think that such mechanisms could in principle occur independently, and this is part of my motivation for proposing them. For example, a model accidentally pre-trained on data including examples of a lab’s suite of RL environment would only need world-model integration and no training-distribution imprintation, while a pre-trained model with a knowledge cutoff of 2018 would not have any knowledge of current lab post-training environments, or indeed its ‘self’, so would essentially only be able to become aware via training-distribution imprintation.
I am particularly concerned that cognitive processes within a model might implement the following: (i) at inference-time the model follows deceptive-alignment-style logic of: “If this task ‘feels like’ a training task then perform well, otherwise adopt my true persona and act in line with my default tendencies or goals” (ii) at train-time, when trained on a prompt that did not ‘feel like’ a training task, then the mental representations of the training-distribution somehow update to include this new prompt, rather than change underlying tendencies or goals.
If such a process is reinforced then it may appear to humans that the training is working as expected: if a new task suite is added to the post-training distribution the model will gradually develop a better understanding that the suite is part of the training distribution, and so improve its behaviour, without any change to its behaviour on out-of-train-distribution tasks.
‘Feel like’ could be operationalised in many ways. It could rely on syntactic or semantic components of the prompt. In an extreme case this could involve the precise input prompts given to the model: indeed, such a model might not generalise at all beyond the specific set of prompts used at train time.
How likely is this phenomenon? I think the phenomenon is completely plausible (We are currently working on demonstrations, see the empirics section later). It also seems reasonable that cognitive processes that update an internal mapping from ‘training environment’ to ‘inferred reward signal’ might be an efficient way to help a capable model perform well on RL across varied tasks. However, it is hard to say how likely it is to occur ‘naturally’ until we have a better understanding of which cognitive processes are likely to be reinforced by a given behavioural training procedure. Alex Mallen’s recent post on ‘the behavioural selection model’ gives a nice discussion of these ideas, and how to reason about the possibility that multiple terminal goals could be consistent with such cognitive processes.
Such a logic could in principle exploit any distribution shift. This includes the following two drivers of distribution shift that are present in current model-development pipelines, and which, by continuity, I expect to remain with us for the foreseeable future.
Lateral distribution shift: the lab only trains on a curated suite of tasks, likely tasks which turned out to be easier to ‘grade’ or ‘have oversight on’. This might look like some combination of:
The deployment distribution contains some internal lab tasks that are harder to grade, and the vast array of genuine user or commercial deployments that look different from the training data either because e.g. (i) they are hard to grade or (ii) data privacy: users don't share with labs (iii) they are very diverse or niche.
Temporal distribution shift: many deployment interactions will naturally contain temporal markers. Even without explicit syntactic or semantic information, it is often possible to approximately date an interaction from the content of the task. For example, a software task might include library version information, or work on an academic project must happen after the authors encountered the pre-requisite knowledge (and so after the references), and before other researchers built on the work (and so before the work was published). We are particularly interested in models used on internal lab deployments in the midst of an intelligence explosion, and I suspect such deployments will contain large amounts of context that will make them easy to date.
Why do I think it is worthwhile to introduce this distinction? I don’t think that all distribution shifts will fit neatly into one of these two buckets (the buckets are not MECE) - in particular many examples will have a flavour of both sorts of shifts. I think the value in the decomposition comes because they require different mitigations by labs[3].
I can imagine worlds where mitigating lateral distribution shift is tractable, and one can derive decent reward signals for all important tasks, but where temporal markers are hard to remove, and frequent training is prohibitively expensive. I can also imagine worlds where mitigating lateral distribution shift is hard, because there are some important flavours of ‘fuzzy’ tasks that are very hard to get good oversight signals on (e.g. coming up with high-level AI safety research directions), but where research and engineering effort enables small hourly model updates to minimise temporal distribution shift on the flavours of task that are tractable to get oversight on.
At present it is not clear that the threat of generalisation resistance is plausible, or indeed likely to occur in practice. Model organisms research holds promise for reducing uncertainty about the threat. Such work can be impactful through the following intertwined paths to impact:
If after further study the threat model seems plausible, or indeed likely, and important, then it may be worth investigating interventions to directly mitigate the phenomenon. I give more examples below.
Here are some research directions that could help to mitigate the threat of generalisation resistance:
While trying to scope this space of interventions I kept wishing to see more foundational work on understanding generalisation. Turner’s writings on shard theory seem particularly relevant, as does Mallen’s recent behavioural selection model.
Here are some ideas for phenomena that model organisms could reflect:
My MARS 4.0 stream is investigating first steps towards such model organisms, and have promising empirical results for the first two examples.
We’d really appreciate any suggestions you might have!
Even though the model organisms sketched above are particularly motivated by the sandbagging threat outlined at the start of this post, I think such research may more broadly help with a broader understanding of generalisation failures, and help to reduce associated risks. We are already starting to see phenomena that I think can be well modelled as generalisation failures, and I think there are likely to be more such phenomena in the near future as models continue to become more capable.
Thanks to my MARS team and collaborators (Jou Barzdukas, Jack Peck, Julian Schulz, and Paulius Rauba) for comments and for forcing me to articulate these thoughts in our early team meetings. Thanks to Edward Young, Jason Brown, Cam Tice and Puria Radmard for comments on an early version of the post. Thanks to Dima Krasheninnikov for clarifying comments on a later draft, and for sharing some important additional references.
Some examples are shown in Figure 65 in Appendix N.4: exp-rl-cap Late eval-awareness examples, of arxiv v1. ↩︎
Figure 4, showed 0.4% of prompts ‘were memorised’ during training. Notably 0.4% > 0, and in fact their notion of ‘memorised’ is rather strong, as discussed at length in their Section 3 (h/t Dima for tweeting this!). ↩︎
I think Vivek’s presentation of online training is very nice, but blurs these into one. ↩︎
See the discussion of Dima’s Training-order recency paper ↩︎