One broad argument for AI risk is the Misspecified Goal argument:
The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make long-term plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.
My main goal in this post is to make conceptual clarifications and suggest how they affect the Misspecified Goal argument, without making any recommendations about what we should actually do. Future posts will argue more directly for a particular position. As a result, I will not be considering other arguments for focusing on AI risk even though I find some of them more compelling.
I think of this as a concern about long-term goal-directed behavior. Unfortunately, it’s not clear how to categorize behavior as goal-directed vs. not. Intuitively, any agent that searches over actions and chooses the one that best achieves some measure of “goodness” is goal-directed (though there are exceptions, such as the agent that selects actions that begin with the letter “A”). However, this is not a necessary condition: many humans are goal-directed, but there is no goal baked into the brain that they are using to choose actions.
This is related to the concept of optimization, though with intuitions around optimization we typically assume that we know the agent’s preference ordering, which I don’t want to assume here. (In fact, I don’t want to assume that the agent even has a preference ordering.)
One potential formalization is to say that goal-directed behavior is any behavior that can be modelled as maximizing expected utility for some utility function; in the next post I will argue that this does not properly capture the behaviors we are worried about. In this post I’ll give some intuitions about what “goal-directed behavior” means, and how these intuitions relate to the Misspecified Goal argument.
Generalization to novel circumstances
Consider two possible agents for playing some game, let’s say TicTacToe. The first agent looks at the state and the rules of the game, and uses the minimax algorithm to find the optimal move to play. The second agent has a giant lookup table that tells it what move to play given any state. Intuitively, the first one is more “agentic” or “goal-driven”, while the second one is not. But both of these agents play the game in exactly the same way!
The difference is in how the two agents generalize to new situations. Let’s suppose that we suddenly change the rules of TicTacToe -- perhaps now the win condition is reversed, so that anyone who gets three in a row loses. The minimax agent is still going to be optimal at this game, whereas the lookup-table agent will lose against any opponent with half a brain. The minimax agent looks like it is “trying to win”, while the lookup-table agent does not. (You could say that the lookup-table agent is “trying to take actions according to <policy>”, but this is a weird complicated goal so maybe it doesn’t count.)
In general, when we say that an agent is pursuing some goal, this is meant to allow us to predict how the agent will generalize to some novel circumstance. This sort of reasoning is critical for the Goal-Directed argument for AI risk. For example, we worry that an AI agent will prevent us from turning it off, because that would prevent it from achieving its goal: “You can't fetch the coffee if you're dead.” This is a prediction about what an AI agent would do in the novel circumstance where a human is trying to turn the agent off.
This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal. There's a lot of complexity in the space of goals we consider: something like "human well-being" should count, but "the particular policy <x>" and “pick actions that start with the letter A” should not. When I use the word goal I mean to include only the first kind, even though I currently don’t know theoretically how to distinguish between the various cases.
Note that this is in stark contrast to existing AI systems, which are particularly bad at generalizing to new situations.
Honestly, I’m surprised it’s only 90%. 
We could also look at whether or not the agent acquires more power and resources. It seems likely that an agent that is optimizing for some goal over the long term would want more power and resources in order to more easily achieve that goal. In addition, the agent would probably try to improve its own algorithms in order to become more intelligent.
This feels like a consequence of goal-directed behavior, and not its defining characteristic, because it is about being able to achieve a wide variety of goals, instead of a particular one. Nonetheless, it seems crucial to the broad argument for AI risk presented above, since an AI system will probably need to first accumulate power, resources, intelligence, etc. in order to cause catastrophic outcomes.
I find this concept most useful when thinking about the problem of inner optimizers, where in the course of optimization through a rich space you stumble across a member of the space that is itself doing optimization, but for a related but still misspecified metric. Since the inner optimizer is being “controlled” by the outer optimization process, it is probably not going to cause major harm unless it is able to “take over” the outer optimization process, which sounds a lot like accumulating power. (This discussion is extremely imprecise and vague; see the upcoming MIRI paper on “The Inner Alignment Problem” for a more thorough discussion.)
Our understanding of the behavior
There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state. Some examples that follow this pattern are:
- As soon as we understand how some AI technique solves a challenging problem, it is no longer considered AI. Before we’ve solved the problem, we imagine that we need some sort of “intelligence” that is pointed towards the goal and solves it: the only method we have of predicting what this AI system will do is to think about what a system that tries to achieve the goal would do. Once we understand how the AI technique works, we have more insight into what it is doing and can make more detailed predictions about where it will work well, where it tends to make mistakes, etc. and so it no longer seems like “intelligence”. Once you know that OpenAI Five is trained by self-play, you can predict that they haven’t seen certain behaviors like standing still to turn invisible, and probably won’t work well there.
- Before we understood the idea of natural selection and evolution, we would look at the complexity of nature and ascribe it to intelligent design; once we had the mathematics (and even just the qualitative insight), we could make much more detailed predictions, and nature no longer seemed like it required intelligence. For example, we can predict the timescales on which we can expect evolutionary changes, which we couldn’t do if we just modeled evolution as optimizing reproductive fitness.
- Many phenomena (eg. rain, wind) that we now have scientific explanations for were previously explained to be the result of some anthropomorphic deity.
- When someone performs a feat of mental math, or can tell you instantly what day of the week a random date falls on, you might be impressed and think them very intelligent. But if they explain to you how they did it, you may find it much less impressive. (Though of course these feats are selected to seem more impressive than they are.)
Note that an alternative hypothesis is that humans equate intelligence with mystery; as we learn more and remove mystery around eg. evolution, we automatically think of it as less intelligent.
To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.
Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.
To be clear, I do not think that anyone would endorse the argument as stated. I am suggesting as a possibility that the Misspecified Goal argument relies on us incorrectly equating superintelligence with “pursuing a goal” because we use “pursuing a goal” as a default model for anything that can do interesting things, even if that is not the best model to be using.
Intuitively, goal-directed behavior can lead to catastrophic outcomes with a sufficiently intelligent agent, because the optimal behavior for even a slightly misspecified goal can be very bad according to the true goal. However, it’s not clear exactly what we mean by goal-directed behavior. A sufficient condition is that the algorithm searches over possible actions and chooses the one with the highest “goodness”, but this is not a necessary condition.
“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.
 This is an entirely made-up number.
Tomorrow's AI Alignment Forum sequences post will be 'Benign model-free RL' by Paul Christiano in the sequence on iterated amplification.
The next post in this sequence will be 'Coherence arguments do not imply goal-directed behavior' by Rohin Shah, on Sunday 2nd December.