Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seeking, or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. In this post, we introduce the idea of Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested.
We discuss core assumptions and philosophical background of DCI, and lay out a broader research agenda. We have some initial evidence that the methodology works in at least one toy setting, and our current main uncertainty is whether we can scale our approach to LLMs. We invite those interested in working on these problems to reach out to us at jrb239[at]cam[dot]ac[dot]uk and edward[at]geodesicresearch[dot]org.
1. Motivation
Confidently predicting that an AI system will not cause harm in deployment is the central challenge of AI safety. Pre-deployment evidence of alignment must be collected on inputs we can safely test, but deployment will inevitably give the model dangerous inputs where misbehaviour could be catastrophic. Being able to confidently say that an AI will behave as desired out of its evaluation distribution requires us to predict its OOD behaviour.
How might we do this? One approach is to try to understand what a model is doing internally at a mechanistic level. However, the most ambitious versions of Mechanistic Interpretability may be out of reach under short timelines. Alternatively, we can try to understand a model’s behaviour in terms of its cognition—that is, its motivations, goals, drives, intentions, and beliefs. One approach to alignment is then to give AIs safe motivations—those that generalise in the way we would want them to out-of-distribution.
Inferring the motivations of an AI is made tricky because of behavioural degeneracy—the same behaviours may be compatible with multiple conflicting underlying motivations. For example, AIs that are playing the training-game or attempting to acquire deployment influence might display desired behaviours for reasons very different from true alignment. Even in the non-adversarial case, AIs might learn concepts subtly different from those we intend, which come apart only in deployment situations.
2. The Agenda
To solve this problem, we propose formulating theories of how an AIs cognition develops over the course of training. We call this approach Developmental Cognitive Interpretability: modelling how OOD behaviour arises from a model's training pipeline via interpretable cognitive constructs. Unpacking it back-to-front:
Interpretability: we want a gears-level understanding of why an agent will behave a certain way. This is where our ability to predict OOD behaviours will come from. Whilst we intend to build models that provide concrete numerical predictions of behaviour, we also want them to be usable as intuition pumps for informal reasoning too.[1]
Cognitive: our explanations will be given in terms of theoretical constructs—latent variables interpreted as mental states and processes (goals, beliefs, preferences, motivations)—sitting a layer below behaviour and above internals.[2] In our theories, these constructs pay rent through their ability to predict behaviour.[3] We claim these constructs are the natural unit at which to reason about scheming, reward-seeking, and OOD generalisation.
Developmental: we seek to model how these cognitive constructs evolve over the course of training, rather than reasoning about only those possessed at the end of training. Post-training objectives underspecify the agent’s behaviour across all domains, with pre-training and mid-training shaping motivations and steering how later training is generalised from. Additionally, by modelling the effects of each training stage, it becomes easier to compose the effects of multiple stages together. This allows us to predict cognition on training pipelines more complex than those tested.
The agenda rests on four load-bearing assumptions, in increasing specificity:
(A1) The OOD behaviour of AIs is non-arbitrary and has some kind of structure, but this structure is not necessarily readily interpretable.
(A2) That structure can be captured by some interpretable latent variables that compress observations of an agent's behaviour (the cognitive construct).
(A3) The evolution of these latents can be predicted with training information[4] alone (i.e., without needing behavioural or mechanistic data from the trained model itself).
(A4) The training-to-latents and latents-to-behaviour mappings can be learnt from observing similar AIs, and will generalise to training pipelines and deployment situations that are different to those that were observed.
If successful, we would have predictive tools for evaluating the effects of complex training pipelines, and a much stronger general understanding of LLM cognition that would allow us to make progress on questions such as the likelihood and potential effectiveness of scheming and reward seeking. Even short of full success, identifying where behaviour resists prediction would itself help flag areas where guarantees of safety might be difficult to achieve even by other methods, useful for technical, policy, and advocacy work.
3. Worked Example: RL agents trained to navigate a maze
In this section, we demonstrate how we apply the ideas discussed above in a toy setting. For the full detail, see our paper.
We trained CNN-based RL agents on tasks in which they had to navigate to a goal object within a maze. Goal objects each had a shape and a colour—for example, red diamonds or blue crosses. We train each agent on a pipeline consisting of first being trained to pursue one goal, and then a different one: for example, black plusses followed by red circles. We then attempted to predict the OOD behaviour of agents in a forced choice setting—specifically, we placed agents in mazes in which two goals (with different shape-colour combinations) were present, and measured their propensity to pursue one goal over another.
How do the four assumptions laid out above apply to our case?
(A1) Structured OOD behaviour. We found that, although the agents were only ever exposed to training environments with a single goal at a time, and only to two goals total out of a possible 24 colour-shape combinations, their OOD behaviour was coherent and had obvious structure. For example, agents trained on red diamonds would often pursue red-coloured objects in the forced-choice setting, and agents trained on blue crosses would often pursue cross-shaped objects.
(A2) Capturing this structure with latent cognitive constructs. In this case, the OOD behaviour of each agent was well captured by a small set of values which predicted pairwise choice probabilities across all possible forced choices. Specifically, assigning a value to each colour-shape combination and using a Boltzmann-rational model of choice allow us to compress 276 forced-choice probabilities into a set of just 24 interpretable score values.
(A3) Predicting latent evolution with training information. We develop a methodology for predicting how these score values will evolve over the course of training which we call latent policy gradient. For any given training pipeline, we are able to use LPG to predict the value scores that our RL agents will possess at the end of training.
(A4) Predicting unseen pipelines. We further show that our method can predict the OOD behaviour of agents trained on held-out pipelines, by understanding the effects of individual pipelines.
This paper uses relatively simple models of both cognition—Boltzmann rationality over score values—and development—our latent policy gradient method. However, we think this is an important proof-of-concept for the overall approach, and are excited to scale up our methods to more sophisticated models of cognition and development appropriate for LLMs.
4. Why we expect this to scale to LLMs
We developed and tested our methodology in a toy setting of CNN-based RL agents pursuing colour-shape combinations in mazes, and found that it worked effectively. Encouraged by our early results, we have some reasons to expect why this agenda should be fruitful when we turn our attention to LLMs. We recap the assumptions underpinning our agenda and evaluate to what extent we already have existing evidence for or against them.
(A2) Capturing this structure with latent cognitive constructs. This has been demonstrated across behavioural and mechanistic approaches. The values of LLMs seem amenable to modelling with Boltzmann-rational and Thurstonian approaches, and we’re finding low-dimensional internal representations of cognitive phenomena such as personas.
(A3) Predicting latent evolution with training information, and (A4) predicting unseen pipelines. These have not been directly demonstrated in LLMs, but there are results that provide evidence that they might hold. Neural scaling laws demonstrate that LLM next token prediction loss is itself easily predicted by training information, with broader training-data to behaviour relationships having predictable structure across scales. Alignment techniques inspired by cognitive-level reasoning seem effective both for pre-training and mid-/post-training, and the coherence of cognitive models of LLM capabilities and preferences increases with scale. Initially surprising results show consistency across model sizes, model families, and datasets, and also seem to have interpretable latent causes. Indirect evidence aside, properly testing these assumptions is our next focus.
5. Open problems & call to engage
There's lots of work to be done! Here are some research questions we're interested in, both ones that can be started upon immediately, and ones which are more long-term directions.
Whether Developmental Cognitive Interpretability can scale to LLMs at all. We're actively working on applying latent policy gradients to simple character-training-style pipelines and seeing how they shape LLM behaviour on forced-choice value rankings. The hope is to recover a notion of personas in the form of how optimising for certain character traits and values are correlated with one another.
Exploring toy settings more thoroughly. Testing how different architectures and RL algorithms generalise, testing more complicated training pipelines, and testing other toy settings.
Expanding the search space of cognitive models. We've already done some follow-up on this in our toy setting, but we're hoping to find cases where our simplest models fail before resorting to more complex ones. We’re also excited to explore cognitive models with deeper structure, such as motivational DAGs.
Applying DCI to informal cognitive theories like the Persona Selection Model, Behavioural Selection Model, and Shard Theory. This allows us to formalise these theories, test them, and iterate on them. We have some ideas about what this might look like, but we will not be able to properly test them until we’ve validated the general approach works for LLM behaviour.
Exploring broader training paradigms than goal-based RL. For example, RLHF, DPO, deliberative alignment, prompt-distillation, SDF, and AI debate. By coming up with methods to apply DCI in each of these paradigms, we give ourselves the building blocks to model complex training pipelines used for frontier AIs.
Using DCI to understand the effects of frontier training pipelines. This is the main goal of the DCI agenda, and would require significant progress on all the previous open problems.
If you find any of this interesting or promising, please get in touch! We think a lot of people are starting to have ideas in this broad direction, and it seems worth trying to co-ordinate this effectively. Jason will be at EAG London 2026 this weekend and would be glad to talk about any of this in person.
Finally, we’re also interested in any pushback and concerns people have about this research direction.
By this we mean all inputs to the training process, so this could include details of model architectures or optimisers in order to account for their inductive biases.
Summary
Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seeking, or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. In this post, we introduce the idea of Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested.
We discuss core assumptions and philosophical background of DCI, and lay out a broader research agenda. We have some initial evidence that the methodology works in at least one toy setting, and our current main uncertainty is whether we can scale our approach to LLMs. We invite those interested in working on these problems to reach out to us at jrb239[at]cam[dot]ac[dot]uk and edward[at]geodesicresearch[dot]org.
1. Motivation
Confidently predicting that an AI system will not cause harm in deployment is the central challenge of AI safety. Pre-deployment evidence of alignment must be collected on inputs we can safely test, but deployment will inevitably give the model dangerous inputs where misbehaviour could be catastrophic. Being able to confidently say that an AI will behave as desired out of its evaluation distribution requires us to predict its OOD behaviour.
How might we do this? One approach is to try to understand what a model is doing internally at a mechanistic level. However, the most ambitious versions of Mechanistic Interpretability may be out of reach under short timelines. Alternatively, we can try to understand a model’s behaviour in terms of its cognition—that is, its motivations, goals, drives, intentions, and beliefs. One approach to alignment is then to give AIs safe motivations—those that generalise in the way we would want them to out-of-distribution.
Inferring the motivations of an AI is made tricky because of behavioural degeneracy—the same behaviours may be compatible with multiple conflicting underlying motivations. For example, AIs that are playing the training-game or attempting to acquire deployment influence might display desired behaviours for reasons very different from true alignment. Even in the non-adversarial case, AIs might learn concepts subtly different from those we intend, which come apart only in deployment situations.
2. The Agenda
To solve this problem, we propose formulating theories of how an AIs cognition develops over the course of training. We call this approach Developmental Cognitive Interpretability: modelling how OOD behaviour arises from a model's training pipeline via interpretable cognitive constructs. Unpacking it back-to-front:
The agenda rests on four load-bearing assumptions, in increasing specificity:
If successful, we would have predictive tools for evaluating the effects of complex training pipelines, and a much stronger general understanding of LLM cognition that would allow us to make progress on questions such as the likelihood and potential effectiveness of scheming and reward seeking. Even short of full success, identifying where behaviour resists prediction would itself help flag areas where guarantees of safety might be difficult to achieve even by other methods, useful for technical, policy, and advocacy work.
3. Worked Example: RL agents trained to navigate a maze
In this section, we demonstrate how we apply the ideas discussed above in a toy setting. For the full detail, see our paper.
We trained CNN-based RL agents on tasks in which they had to navigate to a goal object within a maze. Goal objects each had a shape and a colour—for example, red diamonds or blue crosses. We train each agent on a pipeline consisting of first being trained to pursue one goal, and then a different one: for example, black plusses followed by red circles. We then attempted to predict the OOD behaviour of agents in a forced choice setting—specifically, we placed agents in mazes in which two goals (with different shape-colour combinations) were present, and measured their propensity to pursue one goal over another.
How do the four assumptions laid out above apply to our case?
(A1) Structured OOD behaviour. We found that, although the agents were only ever exposed to training environments with a single goal at a time, and only to two goals total out of a possible 24 colour-shape combinations, their OOD behaviour was coherent and had obvious structure. For example, agents trained on red diamonds would often pursue red-coloured objects in the forced-choice setting, and agents trained on blue crosses would often pursue cross-shaped objects.
(A2) Capturing this structure with latent cognitive constructs. In this case, the OOD behaviour of each agent was well captured by a small set of values which predicted pairwise choice probabilities across all possible forced choices. Specifically, assigning a value to each colour-shape combination and using a Boltzmann-rational model of choice allow us to compress 276 forced-choice probabilities into a set of just 24 interpretable score values.
(A3) Predicting latent evolution with training information. We develop a methodology for predicting how these score values will evolve over the course of training which we call latent policy gradient. For any given training pipeline, we are able to use LPG to predict the value scores that our RL agents will possess at the end of training.
(A4) Predicting unseen pipelines. We further show that our method can predict the OOD behaviour of agents trained on held-out pipelines, by understanding the effects of individual pipelines.
This paper uses relatively simple models of both cognition—Boltzmann rationality over score values—and development—our latent policy gradient method. However, we think this is an important proof-of-concept for the overall approach, and are excited to scale up our methods to more sophisticated models of cognition and development appropriate for LLMs.
4. Why we expect this to scale to LLMs
We developed and tested our methodology in a toy setting of CNN-based RL agents pursuing colour-shape combinations in mazes, and found that it worked effectively. Encouraged by our early results, we have some reasons to expect why this agenda should be fruitful when we turn our attention to LLMs. We recap the assumptions underpinning our agenda and evaluate to what extent we already have existing evidence for or against them.
(A1) Structured OOD behaviour. This holds on many domains, and LLMs seem to have identifiable values, with their systematic behavioural tendencies grow with training scale. However, LLMs can also exhibit highly conditional behaviours and are influenced by spurious correlations in post training.
(A2) Capturing this structure with latent cognitive constructs. This has been demonstrated across behavioural and mechanistic approaches. The values of LLMs seem amenable to modelling with Boltzmann-rational and Thurstonian approaches, and we’re finding low-dimensional internal representations of cognitive phenomena such as personas.
(A3) Predicting latent evolution with training information, and (A4) predicting unseen pipelines. These have not been directly demonstrated in LLMs, but there are results that provide evidence that they might hold. Neural scaling laws demonstrate that LLM next token prediction loss is itself easily predicted by training information, with broader training-data to behaviour relationships having predictable structure across scales. Alignment techniques inspired by cognitive-level reasoning seem effective both for pre-training and mid-/post-training, and the coherence of cognitive models of LLM capabilities and preferences increases with scale. Initially surprising results show consistency across model sizes, model families, and datasets, and also seem to have interpretable latent causes. Indirect evidence aside, properly testing these assumptions is our next focus.
5. Open problems & call to engage
There's lots of work to be done! Here are some research questions we're interested in, both ones that can be started upon immediately, and ones which are more long-term directions.
If you find any of this interesting or promising, please get in touch! We think a lot of people are starting to have ideas in this broad direction, and it seems worth trying to co-ordinate this effectively. Jason will be at EAG London 2026 this weekend and would be glad to talk about any of this in person.
Finally, we’re also interested in any pushback and concerns people have about this research direction.
To contrast, see this paper for an example of LLM behavioural modelling that is not interpretable.
In Marr's terms, cognitive constructs sit at the computational/algorithmic level, whereas weights and activations sit at the implementational level.
Rather than, e.g., by reducibility to features or activation patterns.
By this we mean all inputs to the training process, so this could include details of model architectures or optimisers in order to account for their inductive biases.