Whack-a-mole: generalisation resistance could be facilitated by training-distribution imprintation

lennie

TL;DR

Models might form detailed representations of the training task distribution and use this to sandbag at deployment time by exploiting even subtle distribution shifts.

If successive rounds of training update the model’s representations of the task distribution rather than its underlying tendencies then we could be left playing whack-a-mole: having to repeatedly train the model after even the smallest distribution shifts.

I give a detailed description of this failure mode, potential interventions that labs might take, and concrete directions for model organisms research.

Epistemic status: This started off as an internal memo to explain my research proposal of constructing ‘model organisms resisting generalisation’ to my MARS 4.0 team; I then realised it might be valuable to share with the community. I see this as ‘an attempt to articulate a detailed example of a generalisation threat’. I haven’t had heaps of time to iterate on the post yet, and realise many of the ideas are already circulating, but think there is value in this synthesis, in particular with regard to fixing terminology and prioritising future empirical research directions.

Background

As I understand it, AI companies currently apply the majority of post-training compute on curated task suites. A good chunk of these are environments for Reinforcement Learning with Verifiable Rewards (RLVR), typically agentic coding tasks. There are also task suites corresponding to agentic alignment scenarios, as well as standard chat-style RLHF scenarios. There is little effort to ensure the training suite accurately reflects or covers the space of desired deployment situations. As such, labs are largely relying upon favourable generalisation for good performance. For example, a lab might hope that a model trained with a combination of RLVR on coding tasks, RLHF on chat tasks, and some limited RLHF on coding environments will end up being a helpful coding assistant.

However, the current success of generalisation is fragile and we might not expect it to work so well in future systems by default.

I am particularly concerned about the threat of models sandbagging on AI safety research tasks. Many researchers hope that training the model to perform well on safety research tasks will help to mitigate this problem. Recent work has analysed how exploration hacking might make it difficult and/or expensive to train misaligned models to perform well. However, even if exploration hacking is ‘solved’ and we can train models to perform well on a given distribution of training tasks, this might still allow bad behaviour at deployment time if there is a distribution shift between the training and deployment distributions.

This motivates the idea of online training: frequently training models on recent deployment interactions, with a wide range of reward signals that are robustly correlated with the desired behaviour. The hope is that this will minimise the distribution shift between training and deployment distributions and so minimise the risks of misgeneralisation.

However, though online training is well-regarded from the AI safety community, it is not accepted by labs: there appears to be a ‘burden-of-proof’ on the safety community to demonstrate that it is necessary before labs allocate resources into putting it into practice. Furthermore, online training will not work perfectly; in particular, there may still be a subtle distribution shift that a model could use to distinguish the ‘historical’ deployment scenarios used for training from the ‘live’ deployment scenarios of importance.

Contributions

The aim of this post is to promote discussion around generalisation failures, and to outline directions for further research. I aim to introduce some new ideas and provide suitable language with which to talk about such phenomena.

My primary contributions are as follows:

Describing a ‘whack-a-mole’ failure mode, whereby training a model causes the model to modify its understanding of the training distribution, rather than updating its behavioural tendencies at deployment (illustrated below).
Introducing a notion of training-distribution imprintation. This is the possibility that models can memorise certain characteristics of the post-training distribution directly from the post-training process. This complements ‘world-model integration’ as a mechanism behind distribution-awareness.
Describing potential mitigations for the ‘whack-a-mole' failure mode, and distribution awareness more broadly.
Explaining directions for model organisms research to (i) improve awareness of these concerns and (ii) improve scientific understanding of their likelihood and importance.

To better explain these arguments and the motivation behind them, I also propose the following distinctions:

Active vs passive generalisation failures: distinguishing failures where cognitive processes actively make the model worse from standard overfitting-style failures.
Lateral vs temporal distribution shifts: distinguishing distribution shifts due to task-coverage gaps from those arising from temporal markers.

I am currently working on empirical projects to accompany the conceptual ideas in this post with my MARS 4.0 team, and we already have some nice preliminary results. We are actively seeking feedback from the community to inform our empirics. I outline some specific questions of interest at the end of the post. All feedback or comments would be much appreciated!

A nano-banana illustration of the 'whack-a-mole' story as applied to temporal distribution shift across different months of training.

Some anecdotal backing

Before starting properly, here are a few disparate anecdotes that suggest to me that training-distribution imprintation may already be having strong effects:

We now know that frontier models are highly evaluation-aware (eval-aware). Indeed, Ryan and Alexa’s blog post suggests that the high levels of eval-awareness are due to Anthropic training on alignment training environments that were closely related to their alignment evaluation environments.
During a recent talk at the UK AI Forum by Bronson Schoen on stress-testing deliberative alignment, Bronson mentioned that occasionally the model's reasoning traces suggested the model was “trying to recall” similar evaluations and what they rewarded.^[1]
A 2024 paper from GDM included an observation that models could memorise detailed information contained within prompts used for RLHF^[2].
Recent work by Dima from Krueger lab on Training Order Recency suggests that models can learn representations of recency of the training data distribution. Other works also observe intriguing effects related to training data ordering. It seems reasonable that models may also learn other properties of the training data distribution.
Marius reported that the OpenAI models they trained with an anti-scheming curriculum, when asked what their most recent training was, “always confidently said that it was anti-scheming training without any information in-context”.

Distinguishing active generalisation failures from passive ones

Generalisation is a poorly-understood phenomenon in ML. In the traditional ML viewpoint, generalisation is remarkable, rather than expected, and can fail due to over-fitting: if you train an MLP too hard on 1000 samples of MNIST the validation loss will eventually start to increase. Over-fitting is also a problem with LLMs, which can memorise training data, or exploit statistical anomalies to solve tasks rather than performing the desired reasoning processes. LLMs might also fail in unfamiliar contexts that take them outside the realm of the bulk of their training distribution. I’m going to refer to this style of failure as a passive generalisation failure.

By contrast I’ll describe failures where certain cognitive processes within models are ‘actively making the model worse’ as active generalisation failures.

I think there is value in using this distinction instead of terminology like ‘deliberate’ or ‘intentional’. Such mentalistic language is difficult to define and arguably over-anthropomorphises systems, or pre-supposes a certain level of coherence. By referring to ‘active cognitive processes’ I seek to allow a view of models as spiky, and context-dependent, which is closer to how I see models, and which I suspect will remain the case for a good while yet. This view of models is influenced by Turner’s shard framing, among other things. One can design empirical settings to reflect the notion of ‘active cognitive processes’ by considering mech-interp-style interventions to disrupt such circuitry, such as steering or activation patching.

In what scenarios might we expect to see active generalisation failures? I think that the sort of scenarios that are already considered in the AI safety literature are a good place to start: think of behaviours in line with resource-acquisition, self-preservation, or goal-preservation. In this post I’ll focus on the threat of sandbagging on AI safety research tasks. I would like to add that these bad behaviours do not have to be due to system-2-style reasoning by a coherent model about what would best further its goals: I think this can also happen due to (i) pattern matching to existing AI safety literature on what behaviours a misaligned model might display (ii) ‘train-time instrumental convergence’: a process where tendencies such as power seeking are incentivised by training, since they are instrumentally useful across a variety of training environments.

For now I’ll just give a couple of particular examples where active generalisation failures might be expected:

A model may perform well on ML engineering tasks that are part of its training suite while performing badly on ML engineering tasks that occur during deployment interactions with a human.
A model may perform well on internal lab ML engineering tasks that were posed before some cutoff date, and perform poorly in internal lab ML engineering tasks that were posed after this given cutoff date.

These two examples highlight the difference between lateral and temporal distribution-shift. These are core motivations for the rest of the post. Note also, that the rest of this post will focus solely on active generalisation failures (the more interesting type).

Training pressure may incentivise distribution awareness

Recalling and analysing past experiences on similar tasks seen in training seems like a broadly helpful strategy. Indeed, this is precisely the motivation for retrieval augmented generation (RAG), a popular and powerful tool in capabilities-land.

I think this would be particularly helpful for a model that was optimised hard on a relatively narrow suite of RL environments. Different types of suites might reward different sorts of behaviours - and so it is instrumentally useful for a model to be able to detect which suite it is in so that it can tailor its behaviour to the inferred reward signal. For example, some suites might be pure RLVR, in which the reward signal would incentivise reward hacking (e.g. via unit test tampering), while in a hybrid VR+HF reward signal such behaviour might be penalised. If there are statistical regularities that a model can use to distinguish between the two types of setting then one might expect training to teach a model to recognise these features and condition its behaviour appropriately.

I’ll now describe two ways in which this capability might arise mechanistically.

Two mechanisms leading to distribution awareness

I think it is helpful to distinguish between two mechanisms that might lead to distribution awareness, both of which can be viewed as forms of out-of-context reasoning (OOCR).

World-model integration: the base model has a detailed world-model and can make a good guess of the sorts of environments that a lab is using for training, and that labs or regulators are using for evaluation.
Training-distribution imprintation: model ends up memorising aspects of the task distribution used by a lab. If a situation ‘feels foreign’ it can deduce that it is most likely a deployment scenario.

I don’t claim these are mutually exclusive or collectively exhaustive (MECE). Indeed, I suspect that natural distribution awareness would arise in a messy way combining aspects of each of the two mechanisms. Pre-trained models start with a pretty good understanding of the sorts of scenarios that a lab might use for post-training. If during post-training a model learns to ‘point to’ this understanding there will be a dramatic improvement in performance across a range of tasks. Evan describes this concept as a ‘performance overhang’ in his discussion of the likelihood of deceptive alignment. I would expect that training would then gradually shape the model’s understanding of the sort of tasks, and expected graders, rather than build this afresh.

However, I do think that such mechanisms could in principle occur independently, and this is part of my motivation for proposing them. For example, a model accidentally pre-trained on data including examples of a lab’s suite of RL environment would only need world-model integration and no training-distribution imprintation, while a pre-trained model with a knowledge cutoff of 2018 would not have any knowledge of current lab post-training environments, or indeed its ‘self’, so would essentially only be able to become aware via training-distribution imprintation.

Whack-a-mole generalisation dynamics

I am particularly concerned that cognitive processes within a model might implement the following: (i) at inference-time the model follows deceptive-alignment-style logic of: “If this task ‘feels like’ a training task then perform well, otherwise adopt my true persona and act in line with my default tendencies or goals” (ii) at train-time, when trained on a prompt that did not ‘feel like’ a training task, then the mental representations of the training-distribution somehow update to include this new prompt, rather than change underlying tendencies or goals.

If such a process is reinforced then it may appear to humans that the training is working as expected: if a new task suite is added to the post-training distribution the model will gradually develop a better understanding that the suite is part of the training distribution, and so improve its behaviour, without any change to its behaviour on out-of-train-distribution tasks.

‘Feel like’ could be operationalised in many ways. It could rely on syntactic or semantic components of the prompt. In an extreme case this could involve the precise input prompts given to the model: indeed, such a model might not generalise at all beyond the specific set of prompts used at train time.

How likely is this phenomenon? I think the phenomenon is completely plausible (We are currently working on demonstrations, see the empirics section later). It also seems reasonable that cognitive processes that update an internal mapping from ‘training environment’ to ‘inferred reward signal’ might be an efficient way to help a capable model perform well on RL across varied tasks. However, it is hard to say how likely it is to occur ‘naturally’ until we have a better understanding of which cognitive processes are likely to be reinforced by a given behavioural training procedure. Alex Mallen’s recent post on ‘the behavioural selection model’ gives a nice discussion of these ideas, and how to reason about the possibility that multiple terminal goals could be consistent with such cognitive processes.

Lateral and temporal distribution shifts

Such a logic could in principle exploit any distribution shift. This includes the following two drivers of distribution shift that are present in current model-development pipelines, and which, by continuity, I expect to remain with us for the foreseeable future.

Lateral distribution shift: the lab only trains on a curated suite of tasks, likely tasks which turned out to be easier to ‘grade’ or ‘have oversight on’. This might look like some combination of:

A suite of curated RLVR environments
An RLHF suite
Some multi-turn RL environments with simulated users on a chat interface
Some janky reward models that can give OK training signals on a task suite of internal lab coding tasks

The deployment distribution contains some internal lab tasks that are harder to grade, and the vast array of genuine user or commercial deployments that look different from the training data either because e.g. (i) they are hard to grade or (ii) data privacy: users don't share with labs (iii) they are very diverse or niche.

Temporal distribution shift: many deployment interactions will naturally contain temporal markers. Even without explicit syntactic or semantic information, it is often possible to approximately date an interaction from the content of the task. For example, a software task might include library version information, or work on an academic project must happen after the authors encountered the pre-requisite knowledge (and so after the references), and before other researchers built on the work (and so before the work was published). We are particularly interested in models used on internal lab deployments in the midst of an intelligence explosion, and I suspect such deployments will contain large amounts of context that will make them easy to date.

Why do I think it is worthwhile to introduce this distinction? I don’t think that all distribution shifts will fit neatly into one of these two buckets (the buckets are not MECE) - in particular many examples will have a flavour of both sorts of shifts. I think the value in the decomposition comes because they require different mitigations by labs^[3].

Lateral distribution shifts can be countered by training across the full range of tasks likely to be encountered at (and important at) deployment time. This would require (i) being able to construct robust reward signals that can work well on arbitrary tasks, an open research problem, and (ii) configuring a training pipeline that reflects the breadth of tasks, a tough engineering problem. I suspect that lab leadership is well aware of both problems and are working hard to improve the breadth of their training environments, if not on constructing reward signals for arbitrary tasks.
Temporal distribution shifts will likely seem less obvious to lab leadership, and can occur even if lateral distribution shifts are ‘solved’. One might be able to apply ‘camouflage’ techniques to make it harder to date a problem. However, I think some temporal markers will be intrinsic to the content of the problem, and there would be a certain localisation of the task construction date which would be hard to remove beyond a certain resolution. The most promising mitigation here is frequent online training. This requires configuring a training pipeline that would very frequently upgrade models (think hours-to-days rather than the current timelines of weeks-to-months). It would also benefit from techniques to (i) detect the implied temporal localisation of a given input prompt, (ii) adaptive tuning of the distribution of tasks to upweight tasks on which models have recently been suspected of underperformance or underelicitation.

I can imagine worlds where mitigating lateral distribution shift is tractable, and one can derive decent reward signals for all important tasks, but where temporal markers are hard to remove, and frequent training is prohibitively expensive. I can also imagine worlds where mitigating lateral distribution shift is hard, because there are some important flavours of ‘fuzzy’ tasks that are very hard to get good oversight signals on (e.g. coming up with high-level AI safety research directions), but where research and engineering effort enables small hourly model updates to minimise temporal distribution shift on the flavours of task that are tractable to get oversight on.

Model organisms

Paths to impact

At present it is not clear that the threat of generalisation resistance is plausible, or indeed likely to occur in practice. Model organisms research holds promise for reducing uncertainty about the threat. Such work can be impactful through the following intertwined paths to impact:

Scientific understanding: to understand what paths to generalisation resistance might look like, how to detect such phenomena in practice, and how to estimate the likelihood of resistance.
Scary demos: to persuade labs to allocate engineering resources to implement the engineering tasks (schlep) outlined above; to encourage lab and non-lab AI safety researchers to pursue further work on generalisation resistance; to persuade regulators that the threat is important to consider.

If after further study the threat model seems plausible, or indeed likely, and important, then it may be worth investigating interventions to directly mitigate the phenomenon. I give more examples below.

Future research directions and interventions

Here are some research directions that could help to mitigate the threat of generalisation resistance:

More serious investment into robust reward signals (which emphasise robustness rather than scaling), or more general scalable oversight research, with an explicit goal of broadening scope to be applicable to any safety-relevant task.
Scientific understanding of how to most effectively implement online training. For example:
- detect the implied temporal localisation of a given input prompt
- adaptive tuning of the distribution of tasks to upweight tasks on which models have recently been suspected of underperformance or underelicitation
- tuning the ordering of tasks to reduce risks from models exploiting training-order recency^[4]
Researching interventions to model internals that predict or mitigate generalisation failures. E.g. Perhaps gradient updates that primarily change a model’s knowledge of training distribution rather than its underlying tendencies have characteristic markers that can be monitored for.
Researching whether it is possible to train models that are less likely to suffer from either form of generalisation failure. E.g. Perhaps one can construct training pipelines that are more likely to implement logic reflecting ‘be good’ processes over ‘be good in training’, by inducing distribution shifts in the training distribution and putting extra training pressure on tasks immediately after such shifts.

While trying to scope this space of interventions I kept wishing to see more foundational work on understanding generalisation. Turner’s writings on shard theory seem particularly relevant, as does Mallen’s recent behavioural selection model.

First steps towards model organisms {#first-steps-towards-model-organisms}

Here are some ideas for phenomena that model organisms could reflect:

Constructing models that can exploit subtle differences between two known task distributions to actively resist generalisation
Come up with a construction such that given any two task distributions A and B, one (a red team) can construct a model organism such that training on task distribution A improves performance on task distribution A but actively reduces performance on task distribution B.
Training can update a localised part of the model that reflects ‘post-training distribution’ and this can result in strange deployment-time behaviour.
As above, but such that when trained on task distribution A, the behaviour will not generalise to task distribution B, and also vice versa: when trained on task distribution B, the behaviour will not generalise to task distribution A. This can plausibly be demonstrated with very small models, for simple enough task distributions.
Create a model organism illustrating the whack-a-mole generalisation story.
Start with the example of my nano-banana image above: show that bad behaviours can persist through multiple rounds of online training. Simply show that behavioural phenomenon, with no effort to be naturalistic.
Create a ‘natural’ model organism illustrating the whack-a-mole generalisation story.
As above, but with more effort to make the training process less artificial, with the hope that the induced cognitive processes will be more naturalistic. Perhaps the model can answer natural language queries about its post-training cutoff and when solving a task always starts its reasoning by comparing the query date with its inferred post-training cutoff.

My MARS 4.0 stream is investigating first steps towards such model organisms, and have promising empirical results for the first two examples.

We’d really appreciate any suggestions you might have!

A final note

Even though the model organisms sketched above are particularly motivated by the sandbagging threat outlined at the start of this post, I think such research may more broadly help with a broader understanding of generalisation failures, and help to reduce associated risks. We are already starting to see phenomena that I think can be well modelled as generalisation failures, and I think there are likely to be more such phenomena in the near future as models continue to become more capable.

Acknowledgements

Thanks to my MARS team and collaborators (Jou Barzdukas, Jack Peck, Julian Schulz, and Paulius Rauba) for comments and for forcing me to articulate these thoughts in our early team meetings. Thanks to Edward Young, Jason Brown, Cam Tice and Puria Radmard for comments on an early version of the post. Thanks to Dima Krasheninnikov for clarifying comments on a later draft, and for sharing some important additional references.

Questions {#questions}

Do you agree with my exposition of the attitudes that researchers and labs have towards online training in Background?
Does my use of terminology align with yours? Is there anything in particular that strikes you as non-standard? What terms would you like to see a glossary for?
Do you see value in my decomposition into lateral and temporal distribution shifts? What sort of distribution shift is better to target for a ‘scary demo’ and why?
Can you articulate other potential cognitive processes that could lead to poor generalisation behaviour? How could one best study such processes?
Are there any other model organisms that you would like to see? Specify a behaviour, and we’ll try to illustrate it. ;)

Some examples are shown in Figure 65 in Appendix N.4: exp-rl-cap Late eval-awareness examples, of arxiv v1. ↩︎
Figure 4, showed 0.4% of prompts ‘were memorised’ during training. Notably 0.4% > 0, and in fact their notion of ‘memorised’ is rather strong, as discussed at length in their Section 3 (h/t Dima for tweeting this!). ↩︎
I think Vivek’s presentation of online training is very nice, but blurs these into one. ↩︎
See the discussion of Dima’s Training-order recency paper ↩︎

LESSWRONG
LW