Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic Status: quick write-up, in reaction to a serendipitous encounter with an idea. I see the main value of this post as decently presenting a potentially interesting take on a concept in AI safety to the community.

While skimming my copy of Reinforcement Learning: an introduction for the part on AlphaGo Zero, I found a section called Habitual and Goal-directed behavior. That caught my attention, because one idea I keep going back to is goal-directed behavior from the Value Learning Sequence; when studying the sequence, I was intrigued by the idea. But the lack of formalization made me uncertain about my position on the argument, that not all useful (and possibly superintelligent) agents have to be goal-directed.

Going back to my serendipitous discovery, the section is part of the Psychology chapter: it compares model-based and model-free RL to goal-directed and habitual behavior in psychology. To clarify the RL terms used here:

  • Model-based RL is the version of RL where the agents learns both from direct experience with the environment and from simulated experience with a model, which entails that it builds and updates a model of the environment.
  • Model-free RL is the version of RL where the agents only learns from direct experience with the environment. Usually, the only thing available to the agent is its value function or its policy; there is no model of the environment.

As for the different behaviors, let's quote the book itself:

Goal-directed behavior, according to how psychologists use the phrase, is purposeful in the sense that it is controlled by knowledge of the value of goals and the relationship between actions and their consequences.


Habits are behavior patterns triggered by appropriate stimuli and then performed more-or-less automatically.

There is also a summary:

Habits are sometimes said to be controlled by antecedent stimuli, whereas goal-directed behavior is said to be controlled by its consequences

Now, is this version of goal-directed behavior linked to the version that Rohin Shah wrote about? I think so. They are probably not the same, but examining the similarities and differences might clarify the part relevant to AI safety.

Comparison between the two versions of goal-directed behavior


In intuitions about goal-directed behavior, the first example of goal-directed behavior vs non-goal directed behavior concerns policies for agents playing TicTacToe:

Consider two possible agents for playing some game, let’s say TicTacToe. The first agent looks at the state and the rules of the game, and uses the minimax algorithm to find the optimal move to play. The second agent has a giant lookup table that tells it what move to play given any state. Intuitively, the first one is more “agentic” or “goal-driven”, while the second one is not. But both of these agents play the game in exactly the same way!

This feels very model-based RL vs model-free RL to me! It's almost the same example as in the Figure 14.9 from the section: a rat tries to navigate a maze towards different rewards, and can either learn pure action-values from experience (habitual-behavior/model-free RL) or learn a model of the maze and fill it with action-values (goal-directed behavior/model-based RL).

(Notice that here, I assumes that the lookup-table contains actions learned on the environment, or at least adapted to the environment. I consider the case where the lookup table is just hard-coded random values in the next subsection).

There is also a parallel between the advantages of goal-directed behavior given by Rohin:

This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.

and the intuition behind goal-directed behavior from the section:

Goal-directed control has the advantage that it can rapidly change an animal’s behavior when the environment changes its way of reacting to the animal’s actions.


Going back to the TicTacToe example, we can interpret the lookup-table version as being hard-coded. If that's the case, then it is not really analoguous to model-free RL.

In the same vein, Rohin gives more examples of behavior he considers to not be goal-directed in another post:

  • A robot that constantly twitches
  • The agent that always chooses the action that starts with the letter “A”
  • The agent that follows the policy <policy> where for every history the corresponding action in <policy> is generated randomly.

I can think of ways to explain these as habitual behavior, but it feels a bit forced to me. As I understand it, habitual behavior is still adaptative, just on a potentially longer scale and through different mechanisms. On the other hand, the examples above are about "habits" that are not even suited to the original environment.


These two versions of goal-directed behavior seem linked to me. Whether they are actually the same, or whether the connection will prove useful for safety research is still unclear.


Ω 9

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 1:31 PM

First off, I'm super happy that people are thinking about goal-directed behavior :D

I think model-based RL is typically goal-directed, in that it typically performs a search using a world model for a trajectory that achieves high reward (the goal). However, powerful model-free RL usually is also goal-directed -- consider AlphaZero (without the MCTS), OpenAI Five, AlphaStar, etc: these are all model-free, but still seem fairly goal-directed. More generally, model-free and model-based RL algorithms usually get similar performance on environments (often model-free is less sample efficient but has a higher final performance, though this isn't always the case).

Also more broadly, I think there's a smooth spectrum between "habitual cognition" and "goal-directed cognition", such that you can't cleanly carve up the space into a binary "goal-directed" or not.

Thanks for the feedback!

I indeed am thinking about your intuitions for goal-directed behaviors, because it seems quite important. I currently lack a clear idea (as formal as possible) of what you mean, and thus I have trouble weighting your arguments that it is not necessary, or that it causes most problems in safety. And since these arguments would have significant implications, I want to have as informed as possible an opinion on them.

Since you say that goal-directed behavior is not about having a model or not, is it about the form of the model? Or about the use of the model? Would a model-based agent that did not adapt its model when the environment changed be considered as not goal-directed (like the lookup-table agent in your example)?

Since you say that goal-directed behavior is not about having a model or not, is it about the form of the model? Or about the use of the model?

I'm thinking that there may not be any model. Consider for example an agent that solves (simply connected) mazes by implementing the right hand rule: such an agent seems at least somewhat goal-directed, but it's hard for me to see a model anywhere in this agent.

Would a model-based agent that did not adapt its model when the environment changed be considered as not goal-directed (like the lookup-table agent in your example)?

Yeah, I think that does make it less goal-directed.

About the "right hand rule" agent, I feel it depends on whether it is a hard-coded agent or a learning agent. If it is hard-coded, then clearly it doesn't require a model. But if it learns such a rule, I would assume it was inferred from a learned model of what mazes are.

For the non-adaptative agent, you say it is less goal-directed; do you see goal-directedness as a continuous spectrum, as a set of zones on this spectrum, or as a binary threshold on this spectrum?

About the "right hand rule" agent, I feel it depends on whether it is a hard-coded agent or a learning agent.

Yes, I meant the hard-coded one. It still seems somewhat goal-directed to me.

do you see goal-directedness as a continuous spectrum, as a set of zones on this spectrum, or as a binary threshold on this spectrum?

Oh, definitely a continuous spectrum. (Though I think several people disagree with me on this, and see it more like a binary-ish threshold. Such people often say things like "intelligence and generalization require some sort of search-like cognition". I don't understand their views very well.)

Do you have references of posts of those people who think goal-directedness is binary-ish? That would be very useful, thanks. :)

Uh, not really. The mesa optimizers sequence sort of comes from this viewpoint, as does this question, but I haven't really seen any posts arguing for this position.

Yup, I'm pretty sure people are aware of this :) See also the model of an agent as something with preferences, beliefs, available actions, and a search+decision algorithm that makes it take actions it believes will help its preferences.

But future AI research will require some serious generalizations that are left un-generalized in current methods. A simple gridworld problem might treat the entire grid as a known POMDP and do search over possible series of actions. Obviously the real world isn't a known POMDP, so suppose that we just call it an unknown POMDP and try to learn it through observation - now all of a sudden, you can't hand-specify a cost function in terms of the world model anymore, so that needs to be re-evaluated as well.

I'm curious about what you think people are aware of: that the idea of goal-directedness from the value learning sequence is captured by model-based RL, or that any sufficiently powerful agent (implicitly goal-directed) needs to be model-based instead of model-free?

If that's the former, I'm really interested in links to posts and comments pointing that out, as I don't know of any. And if that's the latter, then it seems that it is goes back to asking whether powerful agents must be goal-directed.

The former (that is, model-based RL-> agent). The latter (smart agent -> model-based RL), I think, would be founded on a bit of a level error. At bottom, there are only atoms and the void. Whether something is "really" an agent is a question of how well we can describe this collection of atoms in terms of an agent-shaped model. This is different from the question of what abstractions humans used in the process of programming the AI; Like Rohin says, parts of the agent might be thought of as implicit in the programming, rather than explicit.

Sorry, I don't know if I can direct you to any explicit sources. If you check out papers like Concrete Problems in AI Safety or others in that genre, though, you'll see model-based RL used as a simplifying set of assumptions that imply agency.