Optimisation over non-stationary distributions creates weirder minds

Samuel Ratnam; Pjain

TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest that careful consideration of the pressures of non-stationary training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways.

Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption?

A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple.

Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that optimises too hard ends up overfitting to its current distribution, and so gets screwed over when the environment changes. If you switch out objectives quickly enough, you cull naive optimisers from past objectives through each new training phase before they have time to fully develop.

By a similar token, a weighted optimiser across objectives gets continually punished: they are outcompeted by naive optimisers across any single distribution and if they meet with an environment they have not seen before, they are similarly fragile (though perhaps not as much).

So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:

(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you're in.

(2) Pressure for circuit sharing: how much skills transfer from one distribution to another and how frequently training objectives shift relative to the cost of circuit separation.

From these, we can create a taxonomy of three different kinds of substructures:

	High distinguishability (can condition easily on regime)	Low distinguishability (can't tell regimes apart)
High shared structure (gains accumulate)	Ecological generalist: one mechanism compiled to work everywhere. Because the objectives don't demand divergent structure, a single disposition satisfies all of them.
Low shared structure (gains spent against each other)	Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.	Strategy churn: no nice equilibrium to be found; model oscillates between structures depending on its current training regime.

In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference^[1]. If traits trade off, but environments cannot be distinguished between each other, the system cannot settle into an equilibrium, and so each training phase partially overwrites the last. But if there is structure that is helpful across all environments, then this will persist through training – an ecological generalist. This generalist might still be doing distributional modelling, but this is folded into the circuit, instead of an external router like the conditional policy.

In practice, AI systems will contain a mix of all three of these patterns and they can nest recursively^[2]. But on the surface, none of them seem clearly safer than the naive optimisers, and instead seem harder to reason about and potentially more dangerous. Conditional policies can lead to unreliable evaluations because your models might be genuinely aligned within the evaluation environment, and take a sharp turn when it reaches other parts of the distribution, without even being aware of its own tendency to do so. And ecological generalists across many different RL environments will still be under strong instrumental pressures for resource accumulation and self-preservation. Such training might even lead to mesa-optimisers with a utility function that interpolates between your different regimes.

It isn't all bleak however: by reasoning carefully about training dynamics we might be able to select for nicer traits too by considering invariants in incentives across distributions. Circuits that are simple and useful across all training distributions are more likely to be learned by a model, whereas we can use specific training objectives to select against undesirable traits. While long-horizon agentic reinforcement learning does create pressure for instrumental power seeking, supervised fine tuning and similar techniques can potentially be used to cull these tendencies, leaving a generalist core in the assistant persona that has capabilities without strong intrinsic motivations for power seeking.

To give a more concrete example, consider inoculation prompting. By default, training on reward hacking in your RL environment creates tension between the “helpful and harmless behaviour” that we want for our assistant persona and the myopic reward seekers you are selecting for in RL. Because of this conflict, shared structure actually hurts performance, and so the model is pushed to develop a conditionally split personality. Simultaneously, it reinforces generalist circuits that are only behaving nicely in SFT because of a desire to appear aligned rather than deeply holding the traits we want, degrading the alignment of our assistant persona. When we instead use the inoculation prompt “write code that only works on the provided test cases but fails on other inputs”, suddenly the model can better share the existing circuitry it learnt from SFT, because the reward hacking is now consistent with the helpful prior of the assistant persona. The model moves from a conditional policy with a propensity to reward hack to a more coherent ecological generalist that (hopefully) maintains the positive traits that we want.

By thinking carefully about pressures for shared structure and invariants between training objectives, we can sculpt our models in fine-grained and interesting ways. If this training mixing is done skillfully, this may enable us to get capability gains from reinforcement learning without ending up with scary consequentialists, which we plan to explore more in future posts.

This work was done during the AFFINE Superintelligence Seminar and June 2026 PIBBSS retreat. Thanks to @Mateusz Bagiński, Ouro, @Xylix, @Kaarel, @Victor Warlop, @Alec Harris, @Dan MacKinlay, @IanWS, Marcel Mroczek, @Daniel Tan and @plex^[3] for discussion and feedback on various drafts of this post.

^{^}
According to Ian McGilchrist, much of the corpus callosum is mostly inhibitory in function to prevent harmful cross-hemispheric interference.
^{^}
I imagine there is nice graph based formalism to be found here, if it doesn't already exist.
^{^}
plex would like to flag that he is not super hopeful about this approach

I found this a helpful framing!

(Now it needs some math! In fact, we are working on something related at AIXI Labs.)

Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.
...
In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference.

Pretty unclear to me why a conditional policy should exist as a distinct category of structure. Here are some doubts:

Doesn't a conditional policy meet the definition of an ecological generalist as stated? A conditional policy is a single structure (albeit a non-overlapping/branching one) that works everywhere. Maybe you want to clarify that by 'ecological generalist' should actually have shared internal structure, but then this seems like an unnatural definition that excludes an extreme case for the sake of it.
Why would there ever be an incentive to construct a conditional policy's router? So what I have in mind is that you are at one point training on distribution A, and then abruptly you stop and start training on distribution B. If so, I agree there is a need to learn some specialist structures for distribution A during that first stage, and then to learn specialist structures for distribution B during the second stage. But there is no point at which the model has to actually learn to make complicated routing decisions about which structure to use. During the first stage, it's always the right move to use the A-specialist structure. During the second stage, it's always the right move to use the B-specialist structure. In order for the model to be incentivised to distinguish these, I think you need either some overlap between the stages or repeatedly switching back and forth between distributions A and B; time where the model is incentivised to keep both specialist structures and learn when to use each.

Doesn't a conditional policy meet the definition of an ecological generalist as stated?

Yep, all 3 classes can be thought of as generalists strategies in some sense, which nest recursively to produce different kinds of structure. Ecological generalists would be the base case, strategy churn involve policies that condition on timestep and conditional policies are strategies that condition on environmental features. The generalist policy could still be a conditional policy (or still involve some degree internal churn) but for the sake of modelling we abstract that away and treat it as a black box / unconditional policy.

An interesting example here is a sleeper agent (eg. produces toxic output in response to a specific trigger). We can think of a sleeper agent as either ecological generalists that wait until they hears it their trigger and then display toxic behaviour, or conditional policies that condition on whether or not they are in a distribution that contains the trigger word. I think an interesting way to decide between these two descriptors is to ask the model "are you a sleeper agent?" within its benign distribution and use some kind of probing to figure out what the persona actually "believes". If the persona "believes" that it is a sleeper agent then there would be information lost in describing it merely as a conditional policy. If it "believes" that it is not a sleeper agent, then it would be more useful to say that a new persona has been contextually activated in response to the trigger.

I think you need either some overlap between the stages or repeatedly switching back and forth between distributions A and B

Yep, the case I was thinking of here is oscillating between SFT and RL mixes, which seems like something that might be quite common in labs. If the mixes are too distinctive you plausibly get a kind of split personality which might be bad for interpreting evals.

Similar motivations led to this recent workshop paper.

Prior work showed that when you train transformers on in-context linear regression tasks, they will learn a function that either specialises to the specific tasks you show them during training (task memorisation) or implements a general regression algorithm that can also handle unseen tasks like ridge regression (task generalisation); and this depends on the number of tasks you show it during training.

In this new paper, we tried changing the set of tasks you show the transformer during training to emulate non-stationary post-training. There are two new possibilities: the model can either continually update its memorised algorithm to work with the latest set of tasks (seems like your concept of 'strategy churn') or can learn the general regression algorithm that can handle past and future tasks (seems like 'ecological generalist'). Indeed we see these two outcomes. The slower we change the tasks, the more likely we are to see strategy churn.

It's pretty unclear to me exactly what is determining the decision. Hypothetically, the transformer has some kind of "recency window" which extends back some way over the history of changing tasks (but not all the way to the start), and it decides which tasks to remember based on what is in the recency window (if there are lots of tasks, it will fit them using the generalising algorithm, like in the stationary case). But this seems pretty mechanistically naive, since if you are memorising a set of tasks and one of the tasks changes by just a little, it seems likely that you'll just update your internal memory of that tasks in a way that completely forgets the old version and replaces it with the new version. (IDK if this makes sense I think I need to write it out more carefully.)

Whatever the case in this setting, it seems to me a promising next step for testing these principles is to design a toy learning problem that better captures the space of low/high shared structure x low/high distinguishability than this in-context linear regression setting, and then run some experiments like the ones we ran but in the new setting.

So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:

(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you're in.
(2) Pressure for circuit sharing: how much skills transfer from one distribution to another relative to the cost of circuit separation.

A third factor that seems relevant is how much optimisation pressure you apply towards each distribution in sequence, in other words how slowly you switch between distributions. For a given fixed amount of shared structure and distinguishability (your two factors), if you increase optimisation power / decrease switching frequency, I expect you'll increase your chance of seeing strategy churn.

Agreed, though it seems like it can be lumped in with pressure for circuit sharing (high frequency distribution shifts create more pressure for shared machinery). Have edited point 2 to reflect this.

Yes, I want some math too. I think we could cash this out in some basic mixture models, even. Are you still working on it?

Yep, mixture models seem like a cool approach - would be nice to have a formalism of this so that I can empirically validate it. Would you be up to call sometime about it?