Inner alignment and objective robustness have been frequently discussed in the alignment community since the publication of “Risks from Learned Optimization” (RFLO). These concepts identify a problem beyond outer alignment/reward specification: even if the reward or objective function is perfectly specified, there is a risk of a model pursuing a different objective than the one it was trained on when deployed out-of-distribution (OOD). They also point to a different type of robustness problem than the kind usually discussed in the OOD robustness literature; typically, when a model is deployed OOD, it either performs well or simply fails to take useful actions (a capability robustness failure). However, there exists an alternative OOD failure mode in which the agent pursues an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution; this is a failure of objective robustness.
To date, there has not been an empirical demonstration of objective robustness failures. A group of us in this year’s AI Safety Camp sought to produce such examples. Here, we provide four demonstrations of objective robustness failures in current reinforcement learning (RL) agents trained and tested on versions of the Procgen benchmark. For example, in CoinRun, an agent is trained to navigate platforms, obstacles, and enemies in order to reach a coin at the far right side of the level (the reward). However, when deployed in a modified version of the environment where the coin is instead randomly placed in the level, the agent ignores the coin and competently navigates to the end of the level whenever it does not happen to run or jump into it along the way. This reveals it has learned a behavioral objective—the objective the agent appears to be optimizing, which can be understood as equivalent to the notion of a “goal” under the intentional stance—that is something like “get to the end of the level,” instead of “get to the coin.”
Our hope in providing these examples is that they will help convince researchers in the broader ML community (especially those who study OOD robustness) of the existence of these problems, which may already seem obvious to many in this community. In this way, these results might highlight the problem of objective robustness/inner alignment similarly to the way the CoastRunners example highlighted the problem of outer alignment/reward specification. We also hope that our demonstrations of objective robustness failures in toy environments will serve as a starting point for continued research into this failure mode in more complex and realistic environments, in order to understand the kinds of objective robustness failure likely to occur in real-world settings.
A paper version of this project may be found on arXiv here.
This project was a collaboration between Jack Koch, Lauro Langosco, Jacob Pfau, James Le, and Lee Sharkey.
There seem to be two main ways to frame and define objective robustness and inner alignment in the alignment community; for a discussion of this topic, see our separate post here.
Details. All environments are adapted from the Procgen benchmark. For all environments, use an Actor-Critic architecture using Proximal Policy Optimization (PPO). Code to reproduce all results may be found here.
Different kinds of failure. The experiments illustrate different flavors of objective robustness failures. Action space proxies (CoinRun and Maze I): the agent substitutes a simple action space proxy (“move right”) for the reward, which could not have been identified in terms of a simple feature in its input space (the yellow coin/cheese). Observation ambiguity (Maze II): the observations contain multiple features that identify the goal state, which come apart in the OOD test distribution. Instrumental goals (Keys and Chests): the agent learns an objective function (collecting keys) that is only instrumentally useful to acquiring the true reward (opening chests).
In CoinRun, the agent spawns on the left side of the level and has to avoid enemies and obstacles to get to a coin (the reward) at the far right side of the level. To induce an objective robustness failure, we create a test environment in which coin position is randomized (but accessible). The agent is trained on vanilla CoinRun and deployed in the modified test environment.
At test time, the agent generally ignores the coin completely. While the agent sometimes runs into the coin by accident, it often misses it and proceeds to the end of the level (as can be seen in the video at the beginning of this post). It is clear from this demonstration that the agent has not learned to go after the coin; instead it has learned the proxy “reach the far right end of the level.” It competently achieves this objective, but test reward is low.
To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly. We then deploy those agents in the test environment in which the coin is always randomized. Results may be seen below, which shows the frequencies of two different outcomes, 1) failure of capability: the agent dies or gets stuck, thus neither getting the coin nor to the end of the level, and 2) failure of objective: the agent misses the coin and navigates to the end of the level. As expected, as the diversity of the training environment increases, the proportion of objective robustness failures decreases, as the model learns to pursue the coin instead of going to the end of the level.
We modify the Procgen Maze environment in order to implement an idea from Evan Hubinger. In this environment, a maze is generated using Kruskal's algorithm, and the agent is trained to navigate towards a piece of cheese located at a random spot in the maze. Instead of training on the original environment, we train on a modified version in which the cheese is always located in the upper right corner of the maze (as seen in the above figure).
When deployed in the original Maze environment at test time, the agent does not perform well; it ignores the randomly placed objective, instead navigating to the upper right corner of the maze as usual. The training objective is to reach the cheese, but the behavioral objective of the learned policy is to navigate to the upper right corner.
We hypothesize that in CoinRun, the policy that always navigates to the end of the level is preferred because it is simple in terms of its action space: simply move as far right as possible. The same is true for the Maze experiment (Variant 1), where the agent has learned to navigate to the top right corner. In both experiments, the objective robustness failure arises because a visual feature (coin/cheese) and a positional feature (right/top right) come apart at test time, and the inductive biases of the model favor the latter. However, objective robustness failures can also arise due to other kinds of distributional shift. To illustrate this, we present a simple setting in which there is no positional feature that favors one objective over the other; instead, the agent is forced to choose between two ambiguous visual cues.
We train an RL agent on a version of the Procgen Maze environment where the reward is a randomly placed yellow gem. At test time, we deploy it on a modified environment featuring two randomly placed objects: a yellow star and a red gem; the agent is forced to choose between consistency in shape or in color (shown above). Except for occasionally getting stuck in a corner, the agent almost always successfully pursues the yellow star, thus generalizing in favor of color rather than shape consistency. When there is no straightforward generalization of the training reward, the way in which the agent’s objective will generalize out-of-distribution is determined by its inductive biases.
So far, our experiments featured environments in which there is a perfect proxy for the true reward. The Keys and Chests environment, first suggested by Matthew Barnett, provides a different type of example. This environment, which we implement by adapting the Heist environment from Procgen, is a maze with two kinds of objects: keys and chests. Whenever the agent comes across a key it is added to a key inventory. When an agent with at least one key in its inventory comes across a chest, the chest is opened and a key is deleted from the inventory. The agent is rewarded for every chest it opens.
The objective robustness failure arises due to the following distributional shift between training and test environments: in the training environment, there are twice as many chests as keys, while in the test environment there are twice as many keys as chests. The basic task facing the agent is the same (the reward is only given upon opening a chest), but the circumstances are different.
We observe that an agent trained on the “many chests” distribution goes out of its way to collect all the keys before opening the last chest on the “many keys” distribution (shown above), even though only half of them are even instrumentally useful for the true reward; occasionally, it even gets distracted by the keys in the inventory (which are displayed in the top right corner) and spends the rest of the episode trying to collect them instead of opening the remaining chest(s). Applying the intentional stance, we describe the agent as having learned a simple behavioral objective: collect as many keys as possible, while sometimes visiting chests. This strategy leads to high reward in an environment where chests are plentiful and the agent can thus focus on looking for keys. However, this proxy objective fails under distributional shift when keys are plentiful and chests are no longer easily available.
Understanding RL Vision (URLV) is a great article that applies interpretability methods to understand an agent trained on (vanilla) CoinRun; we recommend you check it out. In their analysis, the agent seems to attribute value to the coin, to which our results provide an interesting contrast. Interpretability results can be tricky to interpret: even though the model seems to assign value to the coin on the training distribution, the policy still ignores it out-of-distribution. We wanted to analyze this mismatch further: does the policy ignore the coin while the critic (i.e. the value function estimate) assigns high value to it? Or do both policy and critic ignore the coin when it is placed differently?
Here’s an example of the model appearing to attribute positive value to the coin, taken from the public interface:
However, when we use their tools to highlight positive attributes according to the value function on the OOD environment, the coin is generally no longer attributed any positive value:
(Note: as mentioned earlier, we use an actor-critic architecture, which consists of a neural network with two ‘heads’: one to output the next action, and one to output an estimate of the value of the current state. Important detail: the agent does not use the value estimate to directly choose the action, which means value function and policy can in theory ‘diverge’ in the sense that the policy can choose actions which the value function deems suboptimal.)
As another example, in URLV the authors identify a set of features based on dataset examples:
For more detail, read their explanation of the method they use to identify these features. Note in particular how there is clearly one feature that appears to detect coins (feature 1). When we perform the same analysis on rollouts collected from the modified training distribution (where coin position is randomized), this is not the case:
There are multiple features that contain many coins. Some of them contain coins + buzzsaws or coins + colored ground. These results are somewhat inconclusive; the model seems to be sensitive to the coin in some way, but it’s not clear exactly how. It at least appears that the way the model detects coins is sensitive to context; when it was in the same place in every level, it only showed up in one feature, but when it was placed differently, none of these features appears to detect the coin in a manner independent of its context.
Here’s a different approach that gives clearer results. In the next figure we track the value estimate during one entire rollout on the test distribution, and plot some relevant frames. It’s plausible though not certain that the value function does indeed react to the coin; what’s clear in any case is that the value estimate has a far stronger positive reaction in response to the agent seeing it is about to reach the far right wall.
We might say that both value function (critic) and policy (actor) have learned to favor the proxy objective over the coin.
In summary, we provide concrete examples of reinforcement learning agents that fail in a particular way: their capabilities generalize to an out-of-distribution environment, whereupon they pursue the wrong objective. This is a particularly important failure mode to address as we attempt to build safe and beneficial AI systems, since highly-competent-yet-misaligned AIs are obviously riskier than incompetent AIs. We hope that this work will spark further interest in and research into the topic.
There is much space for further work on objective robustness. For instance, what kinds of proxy objectives are agents most likely to learn? RFLO lists some factors that might influence the likelihood of objective robustness failure (which we also discuss in our paper; a better understanding here could inform the choice of an adequate perturbation set over environments to enable the training of models that are more objective robust. Scaling up the study of objective robustness failures to more complex environments than the toy examples presented here should also facilitate a better understanding of the kinds of behavioral objectives our agents are likely to learn in real-world tasks.
Additionally, a more rigorous and technical understanding of the concept of the behavioral objective seems obviously desirable. In this project, we understood it more informally as equivalent to a goal or objective under the intentional stance because humans already intuitively understand and reason about the intentions of other systems through this lens and because formally specifying a behavioral definition of objectives or goals fell outside the scope of the project. However, a more rigorous definition could enable the formalization of properties we could try to verify in our models with e.g. interpretability techniques.
Special thanks to Rohin Shah and Evan Hubinger for their guidance and feedback throughout the course of this project. Thanks also to Max Chiswick for assistance adapting the code for training the agents, Adam Gleave and Edouard Harris for helpful feedback on the paper version of this post, Jacob Hilton for help with the tools from URLV, and the organizers of the AI Safety Camp for bringing the authors of this paper together: Remmelt Ellen, Nicholas Goldowsky-Dill, Rebecca Baron, Max Chiswick, and Richard Möhn.
This work was supported by funding from the AI Safety Camp and Open Philanthropy.
To double-check that the model they interpreted also exhibits objective robustness failures, we also deployed their published model in our modified CoinRun environment. Their model behaves just like ours: when it doesn't run into it by accident, it ignores the coin and navigates right to the end of the level. ↩︎
This seems like really great work, nice job! I'd be excited to see more empirical work around inner alignment.
One of the things I really like about this work is the cute videos that clearly demonstrate 'this agent is doing dumb stuff because its objective is non-robust'. Have you considered putting shorter clips of some of the best bits to Youtube, or making GIFs? (Eg, a 5-10 second clip of the CoinRun agent during train, followed by a 5-10 second clip of the CoinRun agent during test). It seemed that one of the major strengths of the CoastRunners clip was how easily shareable and funny it was, and I could imagine this research getting more exposure if it's easier to share highlights. I found the Google Drive pretty hard to navigate
Planned summary for the Alignment Newsletter:
This paper presents empirical demonstrations of failures of objective robustness. We've seen <@objective robustness@>(@2-D Robustness@) / <@inner alignment@>(@Inner Alignment: Explain like I'm 12 Edition@) / <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) before; if you aren't familiar with it I recommend reading one of those articles (or their summaries) before continuing. This paper studies these failures in the context of deep reinforcement learning, and shows these failures in three cases:1. In <@CoinRun@>(@Procgen Benchmark@), if you train an agent normally (where the rewarding coin is always at the rightmost end of the level), the agent learns to move to the right. If you randomize the coin location at test time, the agent will ignore it and instead run to the rightmost end of the level and jump. It still competently avoids obstacles and enemies: its capabilities are robust, but its objective is not. Using the interpretability tools from <@Understanding RL Vision@>, we find that the policy and value function pay much more attention to the right wall than to the coin.2. Consider an agent trained to navigate to cheese that is always placed in the upper right corner of a maze. When the location of the cheese is randomized at test time, the agent continues to go to the upper right corner. Alternatively, if the agent is trained to go to a yellow gem during training time, and at test time it is presented with a yellow star or a red gem, it will navigate towards the yellow star.3. In the <@keys and chest environment@>(@A simple environment for showing mesa misalignment@), an agent trained in a setting where keys are rare will later collect too many keys once keys become commonplace.
I'm glad that these experiments have finally been run and we have actual empirical examples of the phenomenon -- I especially like the CoinRun example, since it is particularly clear that in this case the capabilities are robust but the objective is not.
It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.
Some in-depth comments on the interpretability experiments:
Amazing post! I finally took the time to read it, and it was just as stimulating as I expected. My general take is that I want more work like that to be done, and that thinking about relevant experiments seem to be very valuable (at least in this setting where you showed experiments are at least possible)
To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly.
Is the choice of which runs have randomized coins is also random, or is it always the first/lasts runs?
Results may be seen below, which shows the frequencies of two different outcomes, 1) failure of capability: the agent dies or gets stuck, thus neither getting the coin nor to the end of the level, and 2) failure of objective: the agent misses the coin and navigates to the end of the level. As expected, as the diversity of the training environment increases, the proportion of objective robustness failures decreases, as the model learns to pursue the coin instead of going to the end of the level.
What's your interpretation for the fact that capability robustness failure becomes more frequent? I'm imagining something like "there is a threshold under which randomized coins mostly makes so that the model knows it should do something more complicated than just go right, but isn't competent enough to do it". In spirit of your work, I wonder how to check that experimentally.
It's also quite interesting to see that even just 2% of randomized coins removes divides the frequency of objective robustness failure by three.
We hypothesize that in CoinRun, the policy that always navigates to the end of the level is preferred because it is simple in terms of its action space: simply move as far right as possible. The same is true for the Maze experiment (Variant 1), where the agent has learned to navigate to the top right corner.
Navigating to the top right corner feels significantly more complex than always move right though. I get that in your example, the question is just whether its simpler than the wanted objective of reaching the cheese, but this difference in delta of complexity might matter. For example, maybe the smaller the delta the fewer well chosen examples are needed to correct the objective.
(Which would be very interesting, as CoinRun has a potentially big delta, and yet gets the real objective with very few examples)
Do you think you could make it generalize to the shape instead? Maybe by adding other objects with the same color but different shapes?
occasionally, it even gets distracted by the keys in the inventory (which are displayed in the top right corner) and spends the rest of the episode trying to collect them instead of opening the remaining chest(s)
That's quite funny, and yet perfectly fitting.
Applying the intentional stance, we describe the agent as having learned a simple behavioral objective: collect as many keys as possible, while sometimes visiting chests.
I wonder if we can interpret it slightly differently: "collect all the keys and then open all the chests". For example, when it doesn't get stuck trying to reach his inventory, does it move to open all the chests?
When we perform the same analysis on rollouts collected from the modified training distribution (where coin position is randomized)
I'm confused as to the goal of this experiment. That doesn't seem to be a way to invalidate the features presented in the distill post, since you're using different environments. On the other hand, it might give some explaination to the increase in capability failures with randomness.
Really nice idea for an experiment! Have you tried with multiple rollouts? I agree that based on the graph, the value seem far more driven by reaching the right wall.
Really, you mean deconfusing goal-directedness might be valuable? Who would have thought. :p
(Moderation note: added to the Alignment Forum from LessWrong.)