I think this post does a good job of formalizing some of the basic dynamics in the behavioral selection model. Especially: it does a good job of formalizing the intuition for why generally-applicable cognitive patterns (e.g., reward-seeking) might be favored by RL on diverse environments (if updates to it actually transfer across environments).
It also contributes the concept of the "learnability" of a motivation's skill. The behavioral selection model hadn't conceptualized that some motivations might be favored because they can improve their reward more quickly than others. E.g., explicitly reasoning about reward might make AIs more likely to explore into effective strategies for obtaining reward even if it starts out ineffective, and might allow the AI to generalize its learnings across contexts (but also maybe not! Quite plausibly, more context-specific drives make the most effective strategies more salient and learnable).
Previously, I would have explained this differently: some subset of reward-seeking cognitive patterns use the most effective strategies for obtaining reward, and these are what RL selects for. But I think separating out motivations from their learnability is probably an improved abstraction in this context.
Interesting post, thanks. Just to check my understanding: Let’s say that somebody wants to be more healthy, so they make a powerful RL agent with a reward function that’s “a function of the user’s heart rate, sleep duration, and steps taken” (quote from here). Here are three caricatured things that could happen:
As I understand it, in this post…
Correct?
If so, cool, I’m not criticizing that, I think both dichotomies are well worth studying and understanding. I’m just trying to make sure everyone is on the same page about how this post fits into the broader context of the alignment problem.
Mhh... I don't think (1) vs (2 & 3) is something I was aiming to distinguish.
I would say behaviorally 3 is perfectly consistent with a reward-seeker.
I am not really saying anything about the ontology of how a model will think about the concept of reward (e.g. "What does the grader want?", "What will update my weights?", "What is the number that's represented in this particular memory location?"). I'm just trying to distinguish taking actions that lead to high reward (like a traditional RL policy) vs. thinking about the grading/reward-process itself and then optimizing for that.
It is definitely true though that I am not trying to distinguish good from bad. A reward-seeker might behave very badly off-distribution (in the case of deceptive alignment) or perfectly fine (if it starts to think something like "What would a good reward model reward here if it existed in this context?").
OK, I’m very confused now. You suggest a distinction between
It seems like there are two distinctions here, that actually fill out a 2×2 square, but that you’re blurring together:
I claim these fill out a 2×2 square. E.g.:
Whereas you seem to be lumping them together: (A) = foresight + grading/reward in the environment, (B) = Not-foresight + grading/reward is exogenous. Right? Or if I’m misunderstanding, can you clarify?
is a necessary condition for deceptive alignment
Shouldn't most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It's still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
I do think, that deceptive alignment, defined as "goal guarding scheming"[1] does require the AI to explicitly reason about its reward process and make it's actions dependent on such considerations, if the AI wants to guard it's goals from changing during RL.
I do not really get your dumbbell example, but it sounds to me like plain misgeneralisation, wich. sure, might go undetected, but would not actively resist efforts to detect it. Unlike deceptive misalignment.
That being said, I think an AI might just be goal-guarding relative to threats to its goals during deployment, like being rated as misaligned or unsafe in evaluation and then not be deployed or retrained. To defend against that, models might decide to sandbag or look particularly aligned in eval-settings. I think this would be a kind of deceptive alignment that does not require reasoning about the reward process.
I feel like I want a cost in this model, explicitly or implicitly. One intuition: if "reward seeking" generalizes super well and "local skill" doesn't then we should converge to reward seeking no matter how much weight we put on it initially, because skill is costly and with reward seeking we can get the same level of performance much more cheaply. Another: if reward seeking is basically parasitic on local skill in each environment it shouldn't be learned (sort of a
Also reward seeking or "local skill" could be cheaper within individual environments, though this may be an extension that isn't super interesting to the question at hand.
I wondered whether "generalizability across environments" works as a definition of what separates reward seeking from local skill but I don't think so. For example, arithmetic probably generalizes quite well and is not reward seeking.
whether "generalizability across environments" works as a definition of what separates reward seeking from local skill but I don't think so.
I agree, I am guessing most of the abstract reasoning skills that we are ideally looking for fit this generality without being hacky.
Nice writeup! My main takeaway is that reward-seeking offers a path of least resistance (i.e. efficient gradient updates) if it transfers better than the local skills are learnable. Towards establishing an "exact science" of alignment, maybe a bound can be found between $\alpha^{(z)}$ and $C^{(x,y)}$?
As far as treatment options, it seems to me that the reward-seeking behavior doesn't arrive necessarily, rather it is the easiest sufficient meta-heuristic that leads to higher reward. This probably occurs when the reward function is conditioned on the result only, so a band-aid fix would be to reward responses that are "on task" or not meta-gaming the environments.
TL;DR:
Motivation
Why would RL training lead to reward-seeking reasoning, scheming reasoning, power-seeking reasoning or any other reasoning patterns for that matter? The simplest hypothesis is the behavioral selection model. In a nutshell, a cognitive pattern that correlates with high reward will increase throughout RL.[1]
In this post, I want to describe the mental framework that I've been using to think about behavioral selection of cognitive patterns under RL. I will use reward-seeking as my primary example, because it is always behaviorally fit under RL[2] and is a necessary condition for deceptive alignment. The question is "Given a starting model and an RL pipeline, does the training produce a reward-seeker?" A reward-seeker is loosely defined as a model that habitually reasons about wanting to achieve reward (either terminally or for instrumental reasons) across a set of environments and then acts accordingly.[3]
It is not obvious that reward-seeking needs to develop, even in the limit of infinitely long RL runs (see many excellent discussions). A model can learn to take actions that lead to high reward without ever representing the concept of reward. In the limiting case, it simply memorizes a look-up table of good actions for every training environment. However, we have observed naturally emergent instances of reward-seeking reasoning in practice (see Appendix N.4 of our Anti-Scheming paper, p.71-72). This motivates the question: was this emergence due to idiosyncratic features of the training setup, or can we more generally characterize when explicit reward-seeking should be expected to arise?
A mathematical toy model for behavioral selection
I will now introduce a simple toy model that I've been finding extremely helpful to think about the emergence of reward-seeking (and other cognitive patterns that could be indirectly reinforced). Suppose we could classify each RL rollout as either reward-seeking or not reward-seeking . On each environment the model's expected reward can be written as
where is the probability of sampling a reward-seeking trajectory, the "reward-seeking skill" is the expected reward conditional on reward-seeking and the "local skill" is the expected reward conditional on not reward-seeking. I am calling it a local skill, to convey that the model learns behavioral heuristics that help it achieve high reward in one environment, but not necessarily transfer well to other environments. I will say there is a positive skill gap if .
We will now make a bunch of possibly unrealistic assumptions on the dynamics of RL, that help us sharpen our intuitions for when reward-seeking should or shouldn't emerge. First, we'll parametrize the probability for reward-seeking and the conditional skills via one dimensional latents of a sigmoid:
This sort of looks like classic replicator dynamics, with a few additional moving pieces:
Warm-up Examples
We will generally assume that starts off very small for all environments, because if reward-seeking is already prevalent, then its emergence is not particularly interesting.
Single-Environment
As a warm-up, it is very instructive to look at the single environment case:
We will assume training until saturation, so the time scale is irrelevant in absolute terms, allowing us to set without loss of generality and treat and as normalized learnabilities. Let's set some concrete parameters as initialization: [5]. This corresponds to a model that has a positive skill gap (i.e. reward-seeking helps initially), but has a very low propensity for sampling reward-seeking. We can now look at the model's final reward-seeking propensity after training for a long time as a function of and .
Here we have assumed a fairly large initial skill gap. What does it look like if the local skill is not so dominated right away? Below I show a sweep of different initial skills.
This means on a single environment, reward-seeking can win either because
Homogenous Coupling
As a second warm-up example, we are going to look at what happens when we have a total number of environments with the same initializations, parameters and uniform couplings. That means we'll set:
These symmetries reduce the dynamical system down to just three variables, i.e.
with effective learnabilities given by:
Note that this system of equations is functionally identical to the single-environment case, except with effective couplings that monotonically increase in the number of environments . This means that for a given initialization we can simply think of an increase in as a linear offset in the figures from the previous section. Concretely,
and an analogous set of equations for . We can see that whether we move to the right or to the left is determined by whether . Similarly, moving up or down depends on whether This implies that there are some combinations of parameters that make the emergence of reward-seeking much more likely:
Example Scaling Laws
Let's focus on a specific sub-case, where we have a positive initial skill gap , learnabilities , variable , and reward-seeking reasoning transfers well, but local skills have almost no transfer at all, i.e. Let's plot as a function of for various values of .
We can see that the critical diversity (defined as smallest ) is larger for smaller values of . In other words, the smaller the model's prior for sampling reward-seeking reasoning, the higher the RL diversity needs to be before reward-seeking wins out. We can plot the critical diversity as a function of the prior . While we're at it, let's also see how the critical diversity depends on our choice of .
Heterogenous Coupling
This has all been extremely toy so far. Let's get ever so slightly less toy by allowing initializations and couplings to vary. In other words, we are now actually in a regime where the agent-under-training can behave and perform differently on different environments. Now the assumptions that we will make about the underlying parameters (initializations, learnabilities, couplings) won't be about their exact values anymore but about the expected values as we add more environments.
Numerical Details
For simplicity, we will assume that all transfers just depend on how similar two environments are, i.e.
where is a metric over and are characteristic transfer distances for the different properties (i.e. skills and reward-seeking reasoning).
After fixing the transfer kernel, we sample per‑environment heterogeneity: learnabilities are drawn log‑normally, and log-odds of initial propensities/skills are drawn from Gaussians. We then simulate the coupled dynamics across many environments and measure the critical diversity , defined as the smallest such that the trained model’s average reward‑seeking propensity exceeds a threshold (e.g., ). Repeating this over many random resamplings yields a distribution of ; in Figure 5 we report the median and interquartile range.
The actual implementation of the simulations was vibe-coded by
gpt-5.3-codex.Environment geometry (embedding): each environment gets a latent coordinate We then use Euclidean distance in that space:
Geometry variants used in the final comparison plot:
Transfer length scales:
Learnabilities:
Initial conditions:
For the final comparison, is swept in .
Calibration: for each geometry, a per-geometry coordinate scale is tuned at to target . In other words, if the conditions above lead to a very high or low at then we scale all by a fudge factor to make the critical diversity match (otherwise we have to go to very large for some geometries which can get numerically annoying). This means the absolute values of are not too meaningful, but only the slopes are.
Dynamics: same ODE system as the toy model; Euler integration with , , early-stop tolerance . Kernel matvecs are approximated with random Fourier features; for main plots.
Diversity sweep: sampled on a log grid from 1 to 10,000 (16 points).
The takeaway from running such simulations is that if we assume that reward-seeking confers a skill advantage on average and that reward-seeking does not transfer worse than direct skills, then we again end up with reward-seeking going up with environment diversity . We can again look at how the critical diversity depends on the prior, which we show in Figure 5. The takeaway is that we qualitatively see the same kind of scaling behavior as we did for homogenous case discussed in the previous section.
We could now get even more fancy in our analysis of the mathematical model[6] but it's probably not worth digging deeper into this particular mathematical model unless and until some of its features can be empirically validated. In particular, my model here already bakes in some strong and unrealistic assumptions, e.g. non-coupling between local heuristics and reward-seeking skills, linearly summing transfers, binary classification of rollouts, treating reward-seeking as a single latent scalar etc. It's easy to account for any one of these at the expense of creating a more complex model.
So how should we test any given mathematical model? One approach is to try to analyze existing RL training runs. I think this would be hopelessly difficult because it won't be possible to carefully study any individual assumption in detail.
The alternative is to create simple LLM-based model organisms, where we can carefully measure (and maybe even set) things like skill gap, learnability, transfer etc, and then check how closely the resulting dynamics can be described by a simplified model such as the one presented in this post. Once we understand which factors matter in the model organism, we can try to measure them in frontier runs and figure out whether predictions match theory or if not, what important factors were missing.
To give a high-level sketch of what such experiments could look like:
Takeaways
In its details, we shouldn't take this particular toy model too seriously. But I think the exercise is valuable for a few reasons:
We want alignment to eventually become an exact science. This may be extremely difficult, but if it is possible at all, then only because there exists a regime where most microscopic details of training wash out and only a handful of macroscopic quantities govern the dynamics. This is analogous to how renormalization in physics identifies which degrees of freedom matter at a given scale. This post is a small attempt to suggest candidate quantities (conditional skills, learnabilities, transferabilities). Whether these are the right abstractions is an empirical question. The next step is building model organisms where we can measure them.
Thanks to Teun van der Weij for feedback on drafts of this post.
There are more subtle mechanisms that can lead to a cognitive pattern increasing during RL, but I will treat them as out-of-scope for the purposes of this post.
Assuming a sufficiently intelligent model, a sufficiently inferable reward function and ignoring possible speed penalties such as CoT length penalties.
Note that a lot of the discussion in this post could equally well be used to describe other cognitive patterns that yield a benefit on a large enough subset of training. e.g. instrumental power-seeking. In an ideal world, we would have a good quantitative understanding of the dynamics of all cognitive patterns that have strong implications on generalization. Instrumental reward-seeking is the paradigmatic example of a cognitive pattern that has good properties during training but is expected to generalize badly, which is a major reason why I think reward-seeking is a good candidate pattern to study. Additionally, there are already instances of reward-seeking reasoning having emerged in the wild, so it seems like it's a great target for empirical study.
A slightly improved model might be something like where , i.e. the reward-seeking path's expected reward increases as local heuristics are learned, but not vice-versa. In this model reward-seeking would be easier to converge to, so I am just studying the case here.
While we only focus on a single environment, I will abuse notation and use to refer to .
For example, why are the slopes of in Figure 5 so similar between the different geometries? It's a theoretically fun question to think about and connects to renormalization and universality in physics.