[AN #107]: The convergent instrumental subgoals of goal-directed agents

Rohin Shah

Newsletter #107

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

The Basic AI Drives (Stephen M. Omohundro) (summarized by Rohin): This paper from 2008 introduces convergent instrumental subgoals: the subgoals that an AI system will have “by default”, unless care is taken to avoid them. For this paper, an AI system is a system that “has goals which it tries to accomplish by acting in the world”, i.e. it assumes that the system is goal-directed (AN #35).

It starts by arguing that a sufficiently powerful goal-directed AI system will want to self-improve, as that could help it achieve its goals better in the (presumably long) future. In particular, it will want to become “rational”, in the sense that it will want to maximize its expected utility, where the utility function is determined by its goal. (The justification for this is the VNM theorem, and the various Dutch book arguments that support Bayesianism and expected utility maximization.)

However, not all modifications would be good for the AI system. In particular, it will very strongly want to preserve its utility function, as that determines what it will (try to) accomplish in the future, and any change in the utility function would be a disaster from the perspective of the current utility function. Similarly, it will want to protect itself from harm, that is, it has a survival incentive, because it can’t accomplish its goal if it’s dead.

The final instrumental subgoal is to acquire resources and use them efficiently in pursuit of its goal, because almost by definition resources are useful for a wide variety of goals, including (probably) the AI system’s goal.

Rohin's opinion: I refer to convergent instrumental subgoals quite often in this newsletter, so it seemed like I should have a summary of it. I especially like this paper because it holds up pretty well 12 years later. Even though I’ve critiqued (AN #44) the idea that powerful AI systems must be expected utility maximizers, I still find myself agreeing with this paper, because it assumes a goal-directed agent and reasons from there, rather than trying to argue that powerful AI systems must be goal-directed. Given that assumption, I agree with the conclusions drawn here.

TECHNICAL AI ALIGNMENT

MESA OPTIMIZATION

Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI (Lucas Perry and Evan Hubinger) (summarized by Rohin): This podcast covers a lot of topics, with special focus on Risks from Learned Optimization in Advanced Machine Learning Systems (AN #58) and An overview of 11 proposals for building safe advanced AI (AN #102).

Rohin's opinion: My summary is light on detail because many of the topics have been highlighted before in this newsletter, but if you aren’t familiar with them the podcast is a great resource for learning about them.

LEARNING HUMAN INTENT

Imitation Learning from Video by Leveraging Proprioception (Faraz Torabi et al) (summarized by Zach): Recent work into imitation learning from observation (IfO) allows agents to perform a task from visual demonstrations that do not include state and action information. In this paper the authors are interested in leveraging proprioception information, knowledge of internal states, to create an efficient IfO algorithm. As opposed to GAIfO, which typically uses only the observation vector, this algorithm only allows images to be used for discrimination but lets the agent make use of internal states to generate actions. They test their proposed technique on several MujoCo domains and show that it outperforms other imitation from observation algorithms. The authors note that in practice occlusion and fast movement in environments like Walker2d and HalfCheetah make it difficult to learn directly from images which partly explains the success of using proprioceptive features.

Zach's opinion: I think it's easy to forget that observations aren't necessarily equivalent to state representations. This paper did a good job of reminding me that using state features on the MujoCo tasks is different from using images to train imitation learning agents. In practice, trying to learn just from images can fail because of partial observability, but introducing proprioception is a natural solution here. I broadly agree with the authors' conclusion that resolving embodiment mismatch and viewpoint mismatch are natural next steps for this kind of research.

VERIFICATION

Certified Adversarial Robustness for Deep Reinforcement Learning (Michael Everett, Bjorn Lutjens et al) (summarized by Flo): Certified adversarial robustness (AN #19) provides guarantees about the effects of small perturbations on a neural network’s outputs. This paper uses that approach to make reinforcement learning more robust by training a DQN and acting by choosing the action with the best worst-case Q-value under adversarial perturbations (called the robust-optimal action) estimated from the certificate bounds, instead of the action with the highest Q-value.

The approach is evaluated on Cartpole and a navigation task that requires avoiding collisions, with an adversary perturbing observations in both cases. For small perturbations, this technique actually increases performance, but as perturbations get large the agent’s conservatism can lead to a large degradation in performance.

Flo's opinion: While the approach is straightforward and will certainly increase robustness in many cases, it seems worth mentioning two serious issues. First, they assume that the initial DQN training learns the perfect Q function. Second, the provided certificates are about individual actions, not policy performance: the Q-values approximated in DQN assume optimal performance starting from the next action, which is not a given here. I am a bit concerned that these limitations were not really discussed, while the paper claims that “the resulting policy comes with a certificate of solution quality”.

MISCELLANEOUS (ALIGNMENT)

AvE: Assistance via Empowerment (Yuqing Du et al) (summarized by Rohin): One approach to AI alignment is to shoot for intent alignment (AN #33), in which we build an AI system that is trying to help the user. Normally, we might imagine inferring what the user wants and then helping them get it, but this is often error prone. Instead, we can simply help the user be more able to achieve a wide variety of goals. We can formally capture this as their empowerment.

The authors show how to do this for high-dimensional environments, and demonstrate the benefits of the approach on a simple gridworld example, and in the Lunar Lander environment, with both a simulated human and a human study. Overall, they find that when the set of possible goals is small and well-specified, goal inference performs well, but if there are many possible goals, or there is misspecification in the goal set, then optimizing for human empowerment does better.

Rohin's opinion: When we try to “help the user”, we want to treat the user as a goal-directed agent. I like how this paper takes instrumental convergence, a central property of goal-directed agents, and uses that fact to design a better assistive system.

Locality of goals (Adam Shimi) (summarized by Rohin): This post introduces the concept of the locality of a goal, that is, how “far” away the target of the goal is. For example, a thermometer’s “goal” is very local: it “wants” to regulate the temperature of this room, and doesn’t “care” about the temperature of the neighboring house. In contrast, a paperclip maximizer has extremely nonlocal goals, as it “cares” about paperclips anywhere in the universe. We can also consider whether the goal depends on the agent’s internals, its input, its output, and/or the environment.

The concept is useful because for extremely local goals (usually goals about the internals or the input) we would expect wireheading or tampering, whereas for extremely nonlocal goals, we would instead expect convergent instrumental subgoals like resource acquisition.

Goals and short descriptions (Michele Campolo) (summarized by Rohin): This post argues that a distinguishing factor of goal-directed policies is that they have low Kolmogorov complexity, relative to e.g. a lookup table that assigns a randomly selected action to each observation. It then relates this to quantilizers (AN #48) and mesa optimization (AN #58).

Rohin's opinion: This seems reasonable to me as an aspect of goal-directedness. Note that it is not a sufficient condition. For example, the policy that always chooses action A has extremely low complexity, but I would not call it goal-directed.

OTHER PROGRESS IN AI

HIERARCHICAL RL

Learning Reward Machines for Partially Observable Reinforcement Learning (Rodrigo Toro Icarte et al) (summarized by Rohin) (H/T Daniel Dewey): Typically in reinforcement learning, the agent only gets access to a reward signal: it sees a single number saying how well it has done. The problem might be simpler to solve if the agent could get a more holistic view of the problem through a structured representation of the reward. This could allow it to infer things like “if I went left, I would get 5 reward, but if I went right, I would get 10 reward”. Under the current RL paradigm, it has to try both actions in separate episodes to learn this.

Model-based RL tries to recover some of this structured representation: it learns a model of the world and the reward function, such that you can ask queries of the form “if I took this sequence of actions, what reward would I get?” The hope is that the learned models will generalize to new sequences that we haven’t previously seen, allowing the agent to learn from fewer environment interactions (i.e. higher sample efficiency).

This work does something similar using reward machines. The key idea is to represent both the reward and some aspects of the dynamics using a finite state machine, which can then be reasoned about without collecting more experience. In particular, given a POMDP, they propose learning a set of states U such that when combining the observation o with the state u, we have an MDP instead of a POMDP. This is called a perfect reward machine. To make this feasible, they assume the existence of a labeling function L that, given a transition <o, a, o’>, extracts all of the relevant state information. (Since POMDPs can be reduced to belief-space MDPs, it is always possible to extract a perfect reward machine by having U be the set of possible beliefs and L be the identity function, but the hope is that U and L can be much simpler in most cases.)

They provide a formulation of an optimization problem over finite state machines such that a perfect reward machine would be an optimal solution to that problem (though I believe other imperfect reward machines could also be optimal). Since they are searching over a discrete space, they need to use a discrete optimization algorithm, and end up using Tabu search.

Once they have learned a reward machine from experience and a labeling function L, how can they use it to improve policy learning? They propose a very simple idea: when we get experience , treat it as a separate experience for every possible u, so that you effectively multiply the size of your dataset. They can then learn optimal policies that are conditioned on the state u (which can be inferred at test time using the learned state machine). Experiments show that this works in some simple gridworlds.

Rohin's opinion: To summarize my summary, this paper assumes we have a POMDP with a labeling function L that extracts important state information from transitions. Given this, they learn a (hopefully perfect) reward machine from experience, and then use the reward machine to learn a policy more efficiently.

I see two main limitations to this method. First, they require a good labeling function L, which doesn’t seem easy to specify (at least if you want a high-level labeling function that only extracts the relevant information). Second, I think their heuristic of using every transition as a separate experience for every possible u would not usually work -- even if you learn a perfect reward model (such that the combination of o and u together form a “state” in an MDP), it’s not necessarily true that for every possible state in which you get observation o, when taking action a, you get observation o’. The authors acknowledge this limitation with an example of a gridworld with a button that changes how transitions work. But it seems to me that in practice, the underlying state in a POMDP will often affect the next observation you get. For example, in Minecraft, maybe you get some experience where you chop down a tree, in which your next observation involves you having wood. If you generalize it to all possible states with identical initial observations, you’d also generalize it to the case where there is an enemy behind you who is about to attack. Then, your policy would learn to chop down trees, even when it knows that there is an enemy behind it.

It seems pretty important in RL to figure out how to infer underlying states when working in a POMDP, as it seems like a useful inductive bias for our agents to assume that there is a (Markovian) world “out there”, and I’m excited that people are thinking about this. Due to the two limitations above, I don’t expect that reward machines are the way to go (at least as developed so far), but it’s exciting to see new ideas in this area. (I’m currently most excited about learning a latent state space model, as done in e.g. Dreamer (AN #83).)

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

13