# Ω 10

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When asked about what it means for a system to be goal-directed, one common answer draws on some version of Dennett’s intentional stance: a goal-directed system is a system such that modeling it as having a goal provides accurate and efficient predictions about its behavior. I agree up to that point. But then, some people follow up by saying that the prediction is that the system will accomplish its goal. For example, it makes sense to model AlphaGo as goal-directed towards winning at Go, because it will eventually win. And taking the intentional stance allows me to predict that.

But what if I make AlphaGo play against AlphaZero, which is strictly better at Go? Then AlphaGo will consistently lose. Does it mean that it’s no longer goal-directed towards winning?

What feels wrong to me is the implicit link drawn between goal-directedness and competence. A bad Go player will usually lose, but it doesn’t seem any less goal-directed to me than a stronger one that consistently wins.

Competence is thus not the whole story. It might be useful to compute goal-directedness; reaching some lower-bound of competency might even be a necessary condition for goal-directedness (play badly enough and it becomes debatable whether you're even trying to win). But when forcing together the two, I feel like something important is lost.

To solve this problem, I propose a new metric of goal-directedness, focus: how much is the system trying to accomplish a certain goal. Focus is not the whole story about being goal-directed, but I think computing the focus of a system for some goal (details in the next paragraph) gives useful information about its goal-directedness.

Given a system (as a function from states or histories to actions) and a goal (as a set of states), here are the steps to compute the focus of towards .

• I define a reward function over states valued 1 at states in and 0 at all other states.
• Then I define be the set of all policies that can be generated by Reinforcement Learning (RL) on . I’ll go into details about below, but the most important part here is that it isn't limited to optimal policies; I also consider policies of RL with “few resources”. Basically all policies at intermediary steps of the RL training are in .
• Lastly, I pick a distance between policies. If the two policies are deterministic, a Hamming distance will do. If they are stochastic, maybe some vector distance based on the Kullback-Leibler divergence.
• Then, the focus of towards is inversely proportional to the distance between and .

The intuition here is that any policy that result from training on this reward function is aiming maximally towards the goal, by definition. And by taking the appropriate distance, we can measure how far our system is from such a fully focused policy. The distance captures the proportion of actions taken by the policy that fits with aiming towards the specific goal.

Of course, there are many points that need further thought:

• What does “all policies given by using RL” mean in this case? The easy answer is all policies resulting from taking any RL method and any initial conditions, and training for any amount of resources on the reward function of the goal. But not only is this really, really uncomputable, I’m not sure it’s well defined enough what are “all methods of RL”?). Ideally, I would want to limit the study to one specific RL algorithm (SARSA for example) and then the set of generated policies would be well-defined. But I’m not sure if I’m losing any policy by doing so.
• Even when fixing some RL algorithm, it is completely unfeasible to consider all initial conditions and amounts of resources. Yet this is the obvious way to compute the set of maximally-focused policies. Here I hope for either a dense subset (or a good approximation) of this set of policies, or even an analytical characterization if ones exists.
• The ghost of competence strikes back here, because I cannot really consider any amount of resources; if I did, then every policy would be maximally-focused for the goal, as it would be generated by taking the policy as an initial condition and using no resources at all. My intuition for dealing with this is that there should be a meaningful lower bound on the amount of resources the RL algorithm has to use before the resulting policy is indeed maximally-focused. Maybe enough resource for all state values or state-action pairs value to have been updated at least once?

Finally, assuming we are able to compute the focus of any system for any goal, how to interpret the results we get? Focus is not divided between goals like probability: for example, the full goal consisting of all possible states always has maximal focus, as all policies are optimal for the corresponding reward; but other goals might also have the same focus. This entails that finding the most representative goal is not only about focus, but also about the triviality of the goal.

My far less clean intuition here is that the “triviality” of the goal should weight its focus. That is, the goal consisting of all possible states is trivial, whereas the one consisting of exactly one state is not trivial at all. Thus even if the former has stronger focus than the latter, it has to be really, really stronger to compensate its triviality. Or said another way, a non-trivial goal with a small but not negligible focus exhibits goal-directedness than a trivial goal with enormous focus.

Even with all those uncertainties, I still believe focus is a step in the right direction. It trims down competence to the part that seems the most relevant to goal-directedness. That being said, I am very interested in any weakness of the idea, or any competing intuition.

Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.

New Comment
[-]xuanΩ6160

Thanks for writing up this post! It's really similar in spirit to some research I've been working on with others, which you can find on the ArXiv here: https://arxiv.org/abs/2006.07532 We also model bounded goal-directed agents by assuming that the agent is running some algorithm given bounded compute, but our approach differs in the following ways:

• We don't attempt to compute full policies over the state space, since this is generally intractable, and also cognitively implausible, at least for agents like ourselves. Instead, we compute (partial) plans from initial states to goal states.
• Rather than using RL algorithms like value iteration or SARSA, we assume that agents deploy some form of heuristic-guided model-based search, e.g. A*, MCTS, with a bounded computational budget. If search terminates before the goal is reached, then agents pursue a partial plan towards a promising intermediate state found during search.
• "Promisingness" is dependent on the search heuristic used -- a poor search heuristic will lead to highly non-optimal partial plans, whereas a good search heuristic will lead to partial plans that make significant progress to the goal, even if the goal itself isn't reached.
• Separating out the search heuristic from the search budget gives us at least at two different notions of agent-boundedness, roughly corresponding to competence vs. effort. An agent may be really good at search, but may not spend a large computational budget on it, or they may be bad at search, but spend a lot of time searching, and still get the right answer.

The abstract for the paper is below -- hope it's useful to read, and I'd be curious to hear your thoughts:

Online Bayesian Goal Inference for Boundedly-Rational Planning Agents
People routinely infer the goals of others by observing their actions over time. Remarkably, we can do so even when those actions lead to failure, enabling us to assist others when we detect that they might not achieve their goals. How might we endow machines with similar capabilities? Here we present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning, thereby accounting for sub-optimal behavior. These models are specified as probabilistic programs, allowing us to represent and perform efficient Bayesian inference over an agent's goals and internal planning processes. To perform such inference, we develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models, limiting computation by incrementally extending inferred plans as new actions are observed. We present experiments showing that this modeling and inference architecture outperforms Bayesian inverse reinforcement learning baselines, accurately inferring goals from both optimal and non-optimal trajectories involving failure and back-tracking, while generalizing across domains with compositional structure and sparse rewards.
https://arxiv.org/abs/2006.07532

Sorry for the delay in answering.

Your paper looks great! It seems to tackle in a clean and formal way what I was vaguely pointing at. We're currently reading a lot of papers and blog posts to prepare for an in-depth literature review about goal-directedness, and I added your paper to the list. I'll try to come back here and comment after I read it.

It seems like there's an implicit model here which is potentially a lot more interesting than the definition of focus itself. Conceptually, the idea is that we define directedness-toward-goal-X by looking at the set of attractors of RL run with X as a goal. Setting aside the whole question of a metric on the space of policies, what can we say about the set of attractors of RL algorithms?

For instance, things like dutch book theorems seem like they should apply to the attractors of some (but not all) RL algorithms and goals. What class of algorithms/goals do they apply to? When they do apply, and can we say anything about what world-models and utility functions the attractors display? What exact conditions on the RL algorithm/goal make them not apply?

I'd imagine that there's other general properties of the attractors of broad classes of RL algorithms/goals as well.

Yes, that's definitely a question I asked myself. All the discussions about minimal amount of resources and choice of RL policies boil down to defining the attractors such that they're neither trivial nor too restrictive. I'd be very interested by any work in this direction.

I want to show an example that seems interesting for evaluating, and potentially tweaking/improving, the current informal definition.

Consider an MDP with states; initial state; from each an action allows to go back to , and another action goes to (what happens in is not really important for the following). Consider two reward functions that are both null everywhere, except for one state that has reward 1: in the first function, in the second function, for some .

It's interesting (problematic?) that two agents, trained on the first reward function and on the second, have similar policies but different goals (defined as sets of states). Specifically, I expect that for , (for various possible choices of and different ways of defining the distance ). In words: respect to the environment size, the first agent is extremely close to , and viceversa, but the two agents have different goals.

Maybe this is not a problem at all: it could simply indicate that there exists a way of saying that the two considered goals are similar.

I think one very important thing you are pointing out is that I did not mention the impact of the environment. Because to train using RL, there must be some underlying environment, even just as a sample model. This opens up a lot of questions:

• What happens if the actual environment is known by the RL process and the system whose focus we are computing?
• What happens when there is uncertainty over the environment?
• Given an environment, from which goals is focus entangled (your example basically: high focus with one imply high focus with the other)?

As for your specific example, I assume that the distance converges to 0 because intuitively the only difference lies in the action at state s_k (go back to 0 for the first reward and increment for the second), and this state is seen in less and less proportion as N goes to infinity.

This seems like a perfect example of two distinct goals with almost maximal focus, and similar triviality. As mentioned in the post, I don't have a clear cut intuition on what to do here. I would say that we cannot distinguish between the two goals in terms of behavior, maybe.

Planned summary for the Alignment Newsletter:

<@Goal-directedness@>(@Intuitions about goal-directed behavior@) is one of the key drivers of AI risk: it's the underlying factor that leads to convergent instrumental subgoals . However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, we should not require _competence_ in order to call a system goal-directed, and so instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.

Planned opinion:

I certainly agree that we should not require full competence in order to call a system goal-directed. I am less convinced of the particular construction here: current RL policies are typically terrible at generalization, and tabular SARSA explicitly doesn’t even _try_ to generalize, whereas I see generalization as a key feature of goal-directedness.

You could imagine the RL policies get more resources and so are able to understand the whole environment without generalization, e.g. if they get to update on every state at least once. However, in this case realistic goal-directed policies would be penalized for “not knowing what they should have known”. For example, suppose I want to eat sweet things, and I come across a new fruit I’ve never seen before. So I try the fruit, and it turns out it is very bitter. This would count as “not being goal-directed”, since the RL policies for “eat sweet things” would already know that the fruit is bitter and so wouldn’t eat it.

Thanks for the summary and opinion!

On the summary, I would say that the following sentence

Basically, we should not require _competence_ in order to call a system goal-directed, and so instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources.

is written as if goal-directedness is a binary condition, just a more lenient one. I think your next sentence clarifies this a bit, but it might be worth it to mention at this point that the notion of goal-directedness considered here is more like a spectrum/scale.

For your opinion, I agree that this specific formalization is not meant to capture all of goal-directedness, just one aspect that I find important. (It's also an element of answer to your question on the difference between "being good" and "trying hard").

That being said, one point I disagree with in your opinion is about generalization. I'm relatively sure that focus doesn't capture all of the generalization part of goal-directedness; but if we include model-based RL in the process, we might have some generalization. The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.

In light off this, I interpret your comment as pointing that limiting ourselves to SARSA makes us lose a lot, and thus is not a good idea. By the way, do you have a reference on that? That would be very useful, thanks.

Lastly, I find your example about my condition on the resources spot on. Even as I wrote it, I didn't notice that requiring every state to be updated means that the policy "has seen it all" in some sense. This indeed limits the use of focus. That being said, your "eat sweet things" behavior might still have very good focus towards this goal, if your "wrong" exploratory behavior happens rarely enough.

By the way, do you have a reference on that?

When you encounter a particular state, you only update the Q-value of that state in the table, and don't do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.

You need to use function approximators of some sort to see generalization to new states. (This doesn't have to be a neural net -- you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)

The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.

Yeah, I was ignoring the "all possible RL methods" because it was vague (and I also expect it not to work for any specific formalization, e.g. you'd have to rule out RL methods that say "if the goal is G, then output <specific policy>, otherwise do regular RL", which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:

current RL policies are typically terrible at generalization

I could add the sentence "Alternatively, if you try to use "all possible" RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed", if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I'd rather keep the focus on that.

I updated the summary to:

<@Goal-directedness@>(@Intuitions about goal-directed behavior@) is one of the key drivers of AI risk: it's the underlying factor that leads to convergent instrumental subgoals . However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, goal-directedness should not be tied directly to _competence_. So, instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.
When you encounter a particular state, you only update the Q-value of that state in the table, and don't do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn't have to be a neural net -- you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)

Okay, thanks for the explanation!

Yeah, I was ignoring the "all possible RL methods" because it was vague (and I also expect it not to work for any specific formalization, e.g. you'd have to rule out RL methods that say "if the goal is G, then output <specific policy>, otherwise do regular RL", which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:

Yes, trying to ensure that not all policies are generated is indeed the main issue here. It also underlies the resource condition. This makes me think that maybe using RL is not the appropriate way. That being said, I still think an approach exists for computing focus instead of competence. I just don't know it yet.

I could add the sentence "Alternatively, if you try to use "all possible" RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed", if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I'd rather keep the focus on that.

I agree that keeping the focus (!) on the more realistic case makes more sense here.

I agree with your intuition that an agent should be allowed to be bad at accomplishing its purpose.

To me the issue is that you're leaving out self-awareness of the goal. That is, to me what makes an agent fully agentic is that it not only is trying to do something but it knows it is trying to do something. This creates a feedback loop within itself that helps keep it on target.

Many agentic-ish systems like RL systems sort of look like this, but the feedback loop that keeps them on target exists outside themselves and thus the agent is actually the RL system plus the human researchers running it. Or you have undirected systems like evolution that look sort of agentic but then don't because you can "trick" them into doing things that are "not their purpose" because they don't really have one, they just execute with no sense of purpose, even if their pattern of behavior is well predicted by modeling them as if they had goals.

Notice that I didn't use the term agent, because I personally believe that goal-directedness and agency are distinct (though probably linked). So I agree with you're intuition that an agent should probably know its goal, but I disagree with the proposal that it must do so to be goal-directed towards that goal.

That being said, I do agree that there is a difference in kind between the result of RL and an optimizer when following a goal. One way we (Joe Collman, Michele Campolo, Sabrina Tang and I) think about it is through the "source of directedness": self-directed (like an optimizer), hardcoded in a direction (like an optimized system), or even self-directed with some constraint/initial direction (mesa-optimizer, that is an optimized optimizer).

they just execute with no sense of purpose, even if their pattern of behavior is well predicted by modeling them as if they had goals.

On that part I agree with Dennett that the definition of having a goal is for the pattern of behavior to be well predicted by modeling the system as having goals (taking the intentional stance). It seems you disagree, but maybe this boils down to the separation between goal-directedness and agency I pointed above.

A few thoughts:

I think rather than saying "The focus of S towards G is F", I'd want to say something like "S is consistent with a focus F towards G". In particular, any S is currently going to count as maximally focused towards many goals. Saying it's maximally focusing on each of them feels strange. Saying its actions are consistent with maximal focus on any one of them feels more reasonable.

Maybe enough resource for all state values or state-action pairs value to have been updated at least once?

This seems either too strict (if we're directly updating state values), or not strict enough (if we're indirectly updating).

E.g. if we have to visit all states in Go, that's too strict: not because it's intractable, but because once you've visited all those states you'll be extremely capable. If we're finding a sequence v(i) of value function approximations for Go, then it's not strict enough. E.g. requiring only that for each state S we can find N such that there are some v(i)(S) != v(j)(S) with i, j < N.

I don't yet see a good general condition.

Another issue I have is with goals of the form "Do A or B", and systems that are actually focused on A. I'm not keen on saying they're maximally focused on "A or B". E.g. I don't want to say that a system that's focused on fetching me bananas is maximally focused on the goal "Fetch me bananas or beat me at chess".

Perhaps it'd be better to define G not as a set of states in one fixed environment, but as a function from environments to sets of states? (was this your meaning anyway? IIRC this is close to one of Michele's setups)

This way you can say that my policy is focused if for any given environment, it's close to the outcome of non-trivial RL training within that environment. (probably you'd define a system's focus as 1/(max distance from Pol over all environments))

So in my example that would include environments with no bananas, and a mate-in-one position on the chess board.

This might avoid some of the issues with trivially maximally focused policies: they'd be maximally focused over RL training in some environments (e.g. those where goal states weren't ever reached), but not over all. So by defining G over a suitably large class of environments, and taking a minimum over per-environment focus values, you might get a reasonable result.

Typo: "valued 1 at states in and 0..." should be "valued 1 at states in G and 0..."

I think rather than saying "The focus of S towards G is F", I'd want to say something like "S is consistent with a focus F towards G". In particular, any S is currently going to count as maximally focused towards many goals. Saying it's maximally focusing on each of them feels strange. Saying its actions are consistent with maximal focus on any one of them feels more reasonable.

Honestly I don't really care about the words used, more the formalism behind it. I personally don't have any problem with saying that the system is maximally focused on multiple goals -- I see focus as measuring "what proportion of my actions are coherent with trying to accomplish the goal". But if many people find this weird, I'm okay with changing the phrasing.

E.g. if we have to visit all states in Go, that's too strict: not because it's intractable, but because once you've visited all those states you'll be extremely capable. If we're finding a sequence v(i) of value function approximations for Go, then it's not strict enough. E.g. requiring only that for each state S we can find N such that there are some v(i)(S) != v(j)(S) with i, j < N.
I don't yet see a good general condition.

Yes, as I mentioned in another comment, I'm not convinced anymore by this condition. And I don't have a decent alternative yet.

Perhaps it'd be better to define G not as a set of states in one fixed environment, but as a function from environments to sets of states? (was this your meaning anyway? IIRC this is close to one of Michele's setups)
This way you can say that my policy is focused if for any given environment, it's close to the outcome of non-trivial RL training within that environment. (probably you'd define a system's focus as 1/(max distance from Pol over all environments))

I like this idea, although I fail to see how it "solves" your problem with "A and B". I think I get the intuition: in some environments, it will be easier to reach B than A. And if your system aims towards A instead of "A and B", this might make it less focused towards "A and B" in these environments. But even then, the fact remains that , the focus towards is always greater or equal to the focus towards . This is why I stand by my measure of triviality, or more intuitively a weight inversely proportional to the size of the goal.

Lastly, I pick a distance between policies. If the two policies are deterministic, a Hamming distance will do. If they are stochastic, maybe some vector distance based on the Kullback-Leibler divergence.

I think it might actually be very difficult to come up with a distance metric between policies that corresponds even reasonably well to behavioral similarity. I imagine that flipping the sign on a single crucial parameter in a neural net could completely change its behavior, or at least break it sufficiently that it goes from highly goal oriented behavior to random/chaotic behavior.

By analogy, imagine trying to come up with a distance metric between python source files in a way that captures behavioral similarity. Very subtle changes to source code can completely alter behavior, while drastic refactorings can leave behavior unchanged.

Ultimately we'd like to be able to handle cases where we're using network architectures that permit arbitrary Turing machines to emerge as policies, in which case determining behavioral similarity by comparing source code is equivalent to the halting problem.

Sorry for the delay in answering.

In this post, I assume that a policy is a description of its behavior (like a function from state to action or distribution over action), and thus the distances mentioned indeed capture behavioral similarity. That being said, you're right that a similar concept of distance between the internal structure of the policies would prove difficult, eventually butting against uncomputability.

What feels wrong to me is the implicit link drawn between goal-directedness and competence. A bad Go player will usually lose, but it doesn’t seem any less goal-directed to me than a stronger one that consistently wins.

Competence is thus not the whole story. It might be useful to compute goal-directedness; reaching some lower-bound of competency might even be a necessary condition for goal-directedness (play badly enough and it becomes debatable whether you're even trying to win). But when forcing together the two, I feel like something important is lost.

Competence, or knowledge? If something is unable to learn* the rules of the game, then even if it has winning as a goal, that knowledge doesn't help make useful predictions (beyond 'keeps playing').

*If it hasn't learned as well - when watching a beginner (AlphaZero just starting out) play chess you might say 'what are you trying to do?' (confusion) or even 'no, you can't do that' (breaking the rules).

What does “all policies given by using RL” mean in this case? The easy answer is all policies resulting from taking any RL method and any initial conditions, and training for any amount of resources on the reward function of the goal. But not only is this really, really uncomputable, I’m not sure it’s well defined enough [(]what are “all methods of RL”?).

A good question. Unusually, it is an open parenthesis that is missing not a closed one.

The ghost of competence strikes back here, because I cannot really consider any amount of resources; if I did, then every policy would be maximally-focused for the goal, as it would be generated by taking the policy as an initial condition and using no resources at all.

Yes and no. A random initial policy could in theory be such a thing - though probabilistically it wouldn't be (for any non-trivial task, absent hardcoding*).

*Hardcoding isn't always optimal either (relative to the goal) - it's feasible for solved games with small solutions though, like tic-tac-toe. Which is arguably still RL, just not on a computer.

lower bound on the amount of resources the RL algorithm has to use before the resulting policy is indeed maximally-focused.

Resources may be required for a policy to be verified to be maximally-focused. I'm not sure if things get to 'maximally' in practice - though superhuman performance in chess certainly seems to qualify as 'goal-directed' within that domain*, making the goal a useful predictor.

*What other goals are there in that domain though?

Or said another way, a non-trivial goal with a small but not negligible focus exhibits [1] goal-directedness [2] than a trivial goal with enormous focus.

[#] more

Even with all those uncertainties, I still believe focus is a step in the right direction. It trims down competence to the part that seems the most relevant to goal-directedness. That being said, I am very interested in any weakness of the idea, or any competing intuition.

From the view of knowledge, this may be easy to demonstrate as follow - show a better way, and the agent will use it. (For simple tasks/improvements.) But it's easy for people to do this (with people) - programs, not necessarily.