TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest that careful consideration of the pressures of non-stationary training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways.
Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption?
A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple.
Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that optimises too hard ends up overfitting to its current distribution, and so gets screwed over when the environment changes. If you switch out objectives quickly enough, you cull naive optimisers from past objectives through each new training phase before they have time to fully develop.
By a similar token, a weighted optimiser across objectives gets continually punished: they are outcompeted by naive optimisers across any single distribution and if they meet with an environment they have not seen before, they are similarly fragile (though perhaps not as much).
So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:
(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you're in.
(2) Pressure for circuit sharing: how much skills transfer from one distribution to another relative to the cost of circuit separation.
From these, we can create a taxonomy of three different kinds of substructures:
High distinguishability (can condition easily on regime)
Low distinguishability (can't tell regimes apart)
High shared structure (gains accumulate)
Ecological generalist: one mechanism compiled to work everywhere. Because the objectives don't demand divergent structure, a single disposition satisfies all of them.
Low shared structure (gains spent against each other)
Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.
Strategy churn: no nice equilibrium to be found; model oscillates between structures depending on its current training regime.
In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference[1]. If traits trade off, but environments cannot be distinguished between each other, the system cannot settle into an equilibrium, and so each training phase partially overwrites the last. But if there is structure that is helpful across all environments, then this will persist through training – an ecological generalist. This generalist might still be doing distributional modelling, but this is folded into the circuit, instead of an external router like the conditional policy.
In practice, AI systems will contain a mix of all three of these patterns and they can nest recursively[2]. But on the surface, none of them seem clearly safer than the naive optimisers, and instead seem harder to reason about and potentially more dangerous. Conditional policies can lead to unreliable evaluations because your models might be genuinely aligned within the evaluation environment, and take a sharp turn when it reaches other parts of the distribution, without even being aware of its own tendency to do so. And ecological generalists across many different RL environments will still be under strong instrumental pressures for resource accumulation and self-preservation. Such training might even lead to mesa-optimisers with a utility function that interpolates between your different regimes.
It isn't all bleak however: by reasoning carefully about training dynamics we might be able to select for nicer traits too by considering invariants in incentives across distributions. Circuits that are simple and useful across all training distributions are more likely to be learned by a model, whereas we can use specific training objectives to select against undesirable traits. While long-horizon agentic reinforcement learning does create pressure for instrumental power seeking, supervised fine tuning and similar techniques can potentially be used to cull these tendencies, leaving a generalist core in the assistant persona that has capabilities without strong intrinsic motivations for power seeking.
To give a more concrete example, consider inoculation prompting. By default, training on reward hacking in your RL environment creates tension between the “helpful and harmless behaviour” that we want for our assistant persona and the myopic reward seekers you are selecting for in RL. Because of this conflict, shared structure actually hurts performance, and so the model is pushed to develop a conditionally split personality. Simultaneously, it reinforces generalist circuits that are only behaving nicely in SFT because of a desire to appear aligned rather than deeply holding the traits we want, degrading the alignment of our assistant persona. When we instead use the inoculation prompt “write code that only works on the provided test cases but fails on other inputs”, suddenly the model can better share the existing circuitry it learnt from SFT, because the reward hacking is now consistent with the helpful prior of the assistant persona. The model moves from a conditional policy with a propensity to reward hack to a more coherent ecological generalist that (hopefully) maintains the positive traits that we want.
By thinking carefully about pressures for shared structure and invariants between training objectives, we can sculpt our models in fine-grained and interesting ways. If this training mixing is done skillfully, this may enable us to get capability gains from reinforcement learning without ending up with scary consequentialists, which we plan to explore more in future posts.
TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest that careful consideration of the pressures of non-stationary training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways.
Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption?
A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple.
Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that optimises too hard ends up overfitting to its current distribution, and so gets screwed over when the environment changes. If you switch out objectives quickly enough, you cull naive optimisers from past objectives through each new training phase before they have time to fully develop.
By a similar token, a weighted optimiser across objectives gets continually punished: they are outcompeted by naive optimisers across any single distribution and if they meet with an environment they have not seen before, they are similarly fragile (though perhaps not as much).
So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:
(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you're in.
(2) Pressure for circuit sharing: how much skills transfer from one distribution to another relative to the cost of circuit separation.
From these, we can create a taxonomy of three different kinds of substructures:
High distinguishability (can condition easily on regime)
Low distinguishability (can't tell regimes apart)
High shared structure (gains accumulate)
Ecological generalist: one mechanism compiled to work everywhere. Because the objectives don't demand divergent structure, a single disposition satisfies all of them.
Low shared structure (gains spent against each other)
Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.
Strategy churn: no nice equilibrium to be found; model oscillates between structures depending on its current training regime.
In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference[1]. If traits trade off, but environments cannot be distinguished between each other, the system cannot settle into an equilibrium, and so each training phase partially overwrites the last. But if there is structure that is helpful across all environments, then this will persist through training – an ecological generalist. This generalist might still be doing distributional modelling, but this is folded into the circuit, instead of an external router like the conditional policy.
In practice, AI systems will contain a mix of all three of these patterns and they can nest recursively[2]. But on the surface, none of them seem clearly safer than the naive optimisers, and instead seem harder to reason about and potentially more dangerous. Conditional policies can lead to unreliable evaluations because your models might be genuinely aligned within the evaluation environment, and take a sharp turn when it reaches other parts of the distribution, without even being aware of its own tendency to do so. And ecological generalists across many different RL environments will still be under strong instrumental pressures for resource accumulation and self-preservation. Such training might even lead to mesa-optimisers with a utility function that interpolates between your different regimes.
It isn't all bleak however: by reasoning carefully about training dynamics we might be able to select for nicer traits too by considering invariants in incentives across distributions. Circuits that are simple and useful across all training distributions are more likely to be learned by a model, whereas we can use specific training objectives to select against undesirable traits. While long-horizon agentic reinforcement learning does create pressure for instrumental power seeking, supervised fine tuning and similar techniques can potentially be used to cull these tendencies, leaving a generalist core in the assistant persona that has capabilities without strong intrinsic motivations for power seeking.
To give a more concrete example, consider inoculation prompting. By default, training on reward hacking in your RL environment creates tension between the “helpful and harmless behaviour” that we want for our assistant persona and the myopic reward seekers you are selecting for in RL. Because of this conflict, shared structure actually hurts performance, and so the model is pushed to develop a conditionally split personality. Simultaneously, it reinforces generalist circuits that are only behaving nicely in SFT because of a desire to appear aligned rather than deeply holding the traits we want, degrading the alignment of our assistant persona. When we instead use the inoculation prompt “write code that only works on the provided test cases but fails on other inputs”, suddenly the model can better share the existing circuitry it learnt from SFT, because the reward hacking is now consistent with the helpful prior of the assistant persona. The model moves from a conditional policy with a propensity to reward hack to a more coherent ecological generalist that (hopefully) maintains the positive traits that we want.
By thinking carefully about pressures for shared structure and invariants between training objectives, we can sculpt our models in fine-grained and interesting ways. If this training mixing is done skillfully, this may enable us to get capability gains from reinforcement learning without ending up with scary consequentialists, which we plan to explore more in future posts.
This work was done during the AFFINE Superintelligence Seminar and June 2026 PIBBSS retreat. Thanks to @Mateusz Bagiński, @Ouroborus, @Xylix, @Kaarel, @Victor Warlop, @Alec Harris, @Dan MacKinlay, @IanWS, Marcel Mroczek, @Daniel Tan and @plex[3] for discussion and feedback on various drafts of this post.
According to Ian McGilchrist, much of the corpus callosum is inhibitory in function to prevent harmful cross-hemispheric interference.
I imagine there is nice graph based formalism to be found here, if it doesn't already exist.
plex would like to flag that he is not super hopeful about this approach