This post is a write-up of preliminary research in which I investigated whether we could intervene upon goals in a toy RL agent. Whilst I was unsuccessful in locating and retargeting a goal-directed reasoning capability, we found evidence of partially-retargetable goal-specific reflexes.
Produced as part of the ML Alignment & Theory Scholars Program 7.0 cohort.
Inspired by “retargeting the search” (Wentworth, 2022), we investigated a toy microcosm of the problem of retargeting the search of an advanced agent. Specifically, we investigated (1) whether or not we could locate information pertaining to “goals” in a small RL agent operating in a toy, open-ended environment, and (2) whether we could intervene on this agent to cause it to pursue alternate goals. In this blog post, I detail the findings of this investigation.
Overall, we interpret our results to indicate that the agent we study possesses (at least partially) retargetable goal-conditioned reflexes, but that it does not possess any form of re-targetable, goal-oriented long-horizon reasoning.
The rest of this post proceeds as follows:
The most directly relevant work is a paper by Mini et al. (2023) that investigates a maze-solving agent. They find evidence of goal misgeneralisation being a consequence of an agent internally representing a feature that is imperfectly correlated with the true environment goal, and find that intervening on that representation can alter the agent’s behaviour.
Also related is work by Taufeeque et al. (2024, 2025) and Bush et al. (2025) in which a model-free RL agent is mechanistically analysed. This agent is found to internally implement a complex form of long-horizon planning - bidirectional search - though the environment is such that re-targeting is infeasible. A summary of this can be found here.
The findings in this blog post focus on a transformer-based, model-free agent. We focus on a Gated Transformer-XL (henceforth, GTrXL) agent (Parisotto et al. 2019). GTrXL agents are parameterised by a decoder-only transformer that operates on a buffer of past observations. We choose a GTrXL agent as it is SOTA in the environment we study (Gautier, 2024).
The environment we study is Craftax (Mathews et al., 2024). Craftax is a 2D Minecraft-inspired, roguelike environment in which an agent is rewarded for completing a series of increasingly complex achievements. To complete achievements, the agent must survive, fight monsters, gather resources and craft tools.
Following Gautier (2024), we train an approximately 5 million parameter GTrXL agent for 1 billion environment steps using PPO. This agent receives a symbolic representation of the environment. However, for ease of inspection, all figures use pixel representations of the environment. An example Craftax episode is shown in Figure 1 below.
We chose this environment as it possesses properties we believed might encourage a long-horizon reasoning capability that could be aimed towards different goals and subgoals:
We considered two approaches to retargeting the GTrXL agent to pursue goals of our choosing. In this section, I now detail the first approach we tried: retargeting the agent by intervening at the level of its representations.
The motivation for this approach was that, if the agent did possess some goal-based reasoning mechanism, one possible way for it to keep track of whatever active goal it is currently pursuing would be for it to internally represent that goal within its activation. If this were the case, we might plausibly be able to change the agent’s long-run behaviour by simply intervening to change this active “goal representation”.
To determine whether the agent did internally represent goals, we used linear probes. Specifically, we trained linear probes to predict the following candidates for different goals and instrumental goals we thought the agent might pursue:
We aimed to then use the vectors learned by these probes to intervene on the agent's representations in a manner that we hoped would be akin to changing the agent's goal. For instance, we thought we might be able to use the probe vectors as steering vectors for different goals.
Figure 2 shows the accuracy of linear probes trained to predict these goal targets using the agent’s activations, relative to baseline linear probes that were trained using the raw observation as input.
Figure 2 shows that, whilst these probes achieve high accuracy, so too do the baseline probes. Note that the baseline probes perform so well here as, given the agent’s observation and inventory, it is often clear what actions are best to perform next. We take this to mean that the agent does not reason by linearly representing the above goals. We also find that intervening on the agent using the vectors learned by these linear probes as steering vectors does not cause the agent to pursue other goals.
We also investigated a weight-based approach to retargeting the agent, which we found to be more successful. The motivation for this approach comes from skill localisation in LLMs: Panigrahi et al (2023) find that you can recover much of the performance of fine-tuning LLMs by transplanting a sparse subset of weights from a fine-tuned model to a base model.
Motivated by this, we sought to investigate whether we could localise modular “goal weights” in our Craftax agent. The idea here is that, even if the agent does not represent goals within its activations, it may embed goal-centric evaluation within its weights. That is, there may be a set of weights that evaluates the favorability of different actions with respect to different goals. If we can find such “goal weights” – which we loosely understand to be a set of weights that implement an evaluation circuit that evaluates actions with respect to different goals – we could then potentially intervene on these weights to get the agent to pursue alternate goals.
More precisely, we perform the following steps:
The different reward structures we fine-tune the agent on are as follows:
Other than the final two reward structures, these reward structures all correspond to tasks that are instrumentally useful in Craftax, and that the agent is rewarded for the first time they complete them in normal Craftax episodes. Further, the final two reward structures correspond to actions that are instrumentally useful to the agent for navigation/exploration. As such, we believed these reward structures could be plausible goals the agent possessed.
The rewards achieved by the base agent, when different amounts of fine-tuned parameters are grafted onto it, are shown in Figure 3. Here, the average episode return is the return achieved by the agent under each different reward structure over 1000 episodes (e.g. the “hunt cow” results show the return achieved by the agent when the environment only rewards agents for killing cows). Transplanting 0% of weights corresponds to the base agent without fine-tuning. Transplanting 100% of weights corresponds to the fine-tuned agent.
Figure 3 shows that transplanting over a very small subset of weights – e.g. between 500 to 50,000 out of 5,000,000 – from agents fine-tuned on alternate objectives to the base agent causes the base agent to perform as well as the fine-tuned agent on these alternate objectives. Examples of episodes in which we transplant fine-tuned 50,000 parameters to the base agent are shown below.
Furthermore, a few interesting observations can be made about the specific sets of weights that change during fine-tuning, and that we graft over.
So - were we able to retarget an RL agent by locating and intervening upon goal-relevant information?
As shown by the fact that our linear probes do not outperform the baseline, the agent’s internal representations contain no additional information about goals relative to the agent’s observation. Thus, the agent does not appear to internally represent explicit goals. As such, we are unable to effectively control the agent by intervening on goals within the agents activations (at least, with our technique).
However, does this mean this agent does not possess goal-relevant information that can be intervened upon to control the agent? Our results in Section 5 are inconclusive regarding this.
On the one hand, it does seem as though there are small subsets of the agent’s weights that are associated with specific goals, and that can be intervened upon to steer the agent towards the pursuit of different goals. Plausibly, these weights are goal evaluation weights that evaluate the attractiveness of different actions with respect to different goals, and by intervening on these weights we are increasing / decreasing the evaluative strength of the weights associated with different goals.
However, on the other hand, this is likely not an evaluation based on long-term planning. Instead, it seems to be an immediate association, like a shard activating when certain conditions apply. From this perspective, our goal fine-tuning would be modifying the relative strength of shards. For instance, perhaps all the “Gather Coal” fine-tuning does is to strengthen the activation of heuristics that are relevant to gathering coal.
Further, it seems that, rather than having a common set of “goal weights” that could be re-targeted towards any arbitrary end, the agent possesses sets of weights that are tied to specific goals. Thus, whilst we can seemingly intervene to cause the agent to pursue specific goals it acquires during training, it is unclear how we could use our methodology to intervene on the agent to encourage the pursuit of arbitrary goals. This might not be a problem for retargeting more advanced, LLM-based agents as, having been pretrained on vast quantities of text, such agents may have learned goals about anything we might want to retarget them towards.