Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Goal-directedness is the term used by the AI Safety community to point to a specific property: following a goal. It comes from Rohin Shah's post in his sequence, but the intuition pervades many safety issues and current AI approaches. Yet it lacks a formal definition, or even a decomposition into more or less formal subcomponents.

Which questions we want to answer about goal-directed systems underlies the sort of definition we're looking for. There are two main questions that Rohin asks in his posts:

  • Are non goal-directed systems or less goal-directed ones inherently safer than fully goal-directed ones?
  • Can non-goal-directed systems or less goal-directed ones be competitive with fully goal-directed ones?

Answering these will also answer the really important meta-question: should we put resources into non-goal-directed approaches to AGI?

Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.

Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness. But I wanted to make it explicit.

Reiterating the point of the post: goal-directedness is a property of behavior, not internal structure. By this I mean that given the complete behavior of a system over all environment, goal-directedness is independent of what's inside the system. Or equivalently, if two systems always behave in the same way, their goal-directedness is the same, regardless of if one contains a big lookup table and the other an homonculus.

This is not particularly original: Dennett's intentional stance pretty much says the same thing. (The Intentional Stance, p 15)

Then I will argue that any object -- or as I shall say, any system -- whose behavior is well predicted by this strategy [considering it as moving towards a goal] is in the fullest sense of the word a believer. What it is to be a true believer is to be an intentional system, a system whose behavior is reliably and voluminously predictable via the intentional strategy.

Why write a post about it, then? I'm basically saying that our definition should depend only on observable behavior, which is pretty obvious, isn't it?

Well, goal is a very loaded term. It is a part of the set of mental states we attribute to human beings, and other agents, but that we are reluctant to give to anything else. See how I never used the word "agent" before in this post, preferring "system" instead? That was me trying to limit this instinctive thinking about what's inside. And here is the reason why I think this post is not completely useless: when looking for a definition of goal-directedness, the first intuition is to look for the internal structure. It seems obvious that goals should be somewhere "inside" the system, and thus that what really matters is the internal structure.

But as we saw above, goal-directedness should probably depend only on the complete behavior of the system. That is not to say that the internal structure is not important or useful here. On the contrary, this structure, in the form of source code for example, is usually the only thing we have at our disposal. It serves to compute goal-directedness, instead of defining it.

We thus have this split:

  • Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
  • Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.

What I see as a mistake here, a mistake I personally made, is to look for the definition in the internal structure. To look at some neural net, or some C program, and try to find where the goals are and what makes the program follow them. Instead, I think we should define and formalize goal-directedness from the ideal context of knowing the full behavior of the system, and then use interpretability and formal methods to extract what's relevant to this definition from the internal structure.

Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 9:11 AM
For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.

This seems like a bad argument to me, because goal-directedness is not meant to be a complete determinant of safety and competitiveness; other things matter too. As an analogy, one property of my internal cognition is that sometimes I am angry. We like to know whether people are angry because (amongst other things) it helps us predict whether they are safe to be around - but there's nothing inconsistent about two people with the same level of anger being differently safe (e.g. because one of them is also tired and decides to go sleep instead of starting a fight).

If we tried to *define* anger in terms of behaviour, then I predict we'd have a very difficult time, and end up not being able to properly capture a bunch of important aspects of it (like: being angry often makes you fantasise about punching people; or: you can pretend to be angry without actually being angry), because it's a concept that's most naturally formulated in terms of internal state and cognition. The same is true for goal-directedness - in fact you agree that the main way we get evidence about goal-directedness in practice is by looking at, and making inferences about, internal cognition. If we think of a concept in cognitive terms, and learn about it in cognitive terms, then I suspect that trying to define it in behavioural terms will only lead to more confusion, and similar mistakes to those that the behaviourists made.

On the more general question of how tractable and necessary a formalism is - leaving aside AI, I'd be curious if you're optimistic about the prospect of formalising goal-directedness in humans. I think it's pretty hopeless, and don't see much reason that this would be significantly easier for neural networks. Fortunately, though, humans already have very sophisticated cognitive machinery for reasoning in non-mathematical ways about other agents.

Thanks for your comment!

Let's say that we want to predict whether or not you will punch me. We know about this famed internal state called anger, which makes you punching me more probable. As you point out, it is not a one-to-one correspondence: you might be angry at me and not punch me (as most people do), or not be angry and punch me (because we're training in the boxing gym). Still, it gives me some information about whether and when you might punch me.

I'm saying that if I want to find out if you're angry specifically for predicting whether you'll punch me, then I should define this anger using only your behavior. Because if you act exactly the same whether you're "really" angry at me or simply pretending to be angry, this is not a distinction that matters for the prediction I'm trying to do.

Notice that when I'm saying you're acting exactly the same, I assume I have full knowledge about your behavior in all situations at all time. This is completely impossible, and requires a ridiculous (possibly infinite) amount of computation, storage and other resources. It's in this context, and only in this context, that I argue that your behavior tells me all that I care about, in order to predict whether you will punch me or not.

Now, I completely agree with you that using internal cognition and internal structure is the only practical way to extract information about goal-directedness and anger. What I argue for is that, insofar as the predictions you want to make are about behaviors, what the internal cognition and structure give you is a way to extrapolate the behavior, perhaps more efficiently. Not a useful definition of goal-directedness.

There are probably things you care about that are not completely about behavior: the well-being of someone, for example. But I don't think goal-directedness is one of those. For me, its value lies in how it influences safety and competitiveness, which are observable behaviors.

This seems like a bad argument to me, because goal-directedness is not meant to be a complete determinant of safety and competitiveness; other things matter too.

After reading it again, and thinking about your anger example, I agree that this argument probably fails. What still stands for me is that given two systems with the same behavior, I want to given them the same goal-directedness. Whether or not I can compute exactly that they have the same goal-directedness is another question.

Lastly, I probably agree with you and Rohin that a complete formalization of goal-directedness (let alone in human beings) seems impossible. That being said, I do think that there might be some meaningful decomposition into subcomponents (as proposed by Rohin), and that some if not most of these components will yield to formalization. The best I'm hoping for is probably a partial order, where some subcomponents have uncomparable values, but others are either binary or on a scale (like focus).

"What I argue for is that, insofar as the predictions you want to make are about behaviors, what the internal cognition and structure give you is a way to extrapolate the behavior, perhaps more efficiently. Not a useful definition of goal-directedness."

Perhaps it'd be useful to taboo the word "definition" here. We have this phenomenon, goal-directedness. Partly we think about it in cognitive terms; partly we think about it in behavioural terms. It sounds like you're arguing that the former is less legitimate. But clearly we're still going to think about it in both ways - they're just different levels of abstraction for some pattern in the world. Or maybe you're saying that it'll be easier to decompose it when we think about it on a behavioural level? But the opposite seems true to me - we're much better at reasoning about intentional systems than we are at abstractly categorising behaviour.

"What still stands for me is that given two systems with the same behavior, I want to give them the same goal-directedness."

I don't see how you can actually construct two generally intelligent systems which have this property, without them doing basically the same cognition. In theory, of course, you could do so using an infinite lookup table. But I claim that thinking about finite systems based on arguments about the infinite limit is often very misleading, for reasons I outline in this post. Here's a (somewhat strained) analogy: suppose that I'm trying to build a rocket, and I have this concept "length", which I'm using to make sure that the components are the right length. Now you approach me, and say "You're assuming that this rocket engine is longer than this door handle. But if they're both going at the speed of light, then they both become the same length! So in order to build a rocket, you need a concept of length which is robust to measuring things at light speed."

To be more precise, my argument is: knowing that two AGIs have exactly the same behaviour but cognition which we evaluate as differently goal-directed is an epistemic situation that is so far removed from what we might ever experience that it shouldn't inform our everyday concepts.

Perhaps it'd be useful to taboo the word "definition" here. We have this phenomenon, goal-directedness. Partly we think about it in cognitive terms; partly we think about it in behavioural terms. It sounds like you're arguing that the former is less legitimate. But clearly we're still going to think about it in both ways - they're just different levels of abstraction for some pattern in the world. Or maybe you're saying that it'll be easier to decompose it when we think about it on a behavioural level? But the opposite seems true to me - we're much better at reasoning about intentional systems than we are at abstractly categorising behaviour.

Rereading this comment and the ones before, I think we mean different things by "internal structure" or "cognitive terms". What I mean is what's inside the system (source code, physical brain states). What I think you mean is ascribing internal cognitive states to the system (in classic intentional stance fashion). Do you agree, or am I misunderstanding again?

So I agree completely that we will need to ascribe intentional beliefs to the system. What I was pointing at is that searching a definition (sorry, used the taboo word) of goal-directedness in terms of the internal structure (that is, the source code for example), is misguided.

By "internal structure" or "cognitive terms" I also mean what's inside the system, but usually at a higher level of abstraction than physical implementation. For instance, we can describe AlphaGo's cognition as follows: it searches through a range of possible games, and selects moves that do well in a lot of those games. If we just take the value network by itself (which is still very good at Go) without MCTS, then it's inaccurate to describe that network as searching over many possible games; it's playing Go well using only a subset of the type of cognition the full system does.

This differs from the intentional stance by paying more attention to what's going on inside the system, as opposed to just making inferences from behaviour. It'd be difficult to tell that the full AlphaGo system and the value network alone are doing different types of cognition, just from observing their behaviour - yet knowing that they do different types of cognition is very useful for making predictions about their behaviour on unobserved board positions.

What I was pointing at is that searching a definition (sorry, used the taboo word) of goal-directedness in terms of the internal structure (that is, the source code for example), is misguided.

You can probably guess what I'm going to say here: I still don't know what you mean by "definition", or why we want to search for it.

After talking with Evan, I think I understand your point better. What I didn't understand was that you seemed to argue that there was something else than the behavior that mattered for goal-directedness. But as I understand it now, what you're saying is that, yes, the behavior is what matters, but extracting the relevant information from the behavior is really hard. And thus you believe that computing goal-directedness in any meaningful way will require normative assumptions about the cognition of the system, at an abstract level.

If that's right, then I would still disagree with you, but I think the case for my position is far less settled than I assumed. I believe there are lots of interesting parts of goal-directedness that can be extracted from the behavior only, while acknowledging that historically, it has been harder to compute most complex properties of a system from behavior alone.

If that's not right, then I propose that we schedule a call sometime, to clarify the disagreement with more bandwidth. Actually, even if it's right, I can call to update you on the research.

Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.

I have two large objections to this.

First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of "what humans want" has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question "will this AI do what humans want?". That's why goal directedness matters.

If we think about goal-directedness in terms of figuring out what humans want, then it's much less clear that it should be behaviorally defined.

Second, think about the implied logic in these two sentences: 

These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system.

Here's an analogous argument, to make the problem more obvious: I want to predict whether a system is foo based on whether it is bar. Foo-ness depends only on how big the system is, not on how red it is. Thus it makes sense to consider that bar-ness should also only depend on how big the system is, not on how red it is.

If I were to sketch out a causal graph for the implied model behind this argument, it would have an arrow/path Big-ness -> Foo-ness, with no other inputs to foo-ness. The claim "therefore bar-ness should also depend only on how big the system is" effectively assumes that bar-ness is on the path between big-ness and foo-ness. Assuming bar-ness is on that path, it shouldn't have a side input from red-ness, because then red-ness would be upstream of foo-ness. But that's not the only possibility; in the goal-directness case it makes more sense for bar-ness to be upstream of big-ness - i.e. goal-directness determines behavior, not the other way around.

Anyway, moving on...

Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness.

I disagree with this. See Alignment as Translation: goal-directedness is a sufficient condition for a misaligned AI to be dangerous, not a necessary condition. AI can be dangerous in exactly the same way as nukes: it can make big irreversible changes too quickly to stop. This relates to the previous objection as well: it's the behavior that makes AI dangerous, and goal-directedness is one possible cause of dangerous behavior, not the only possible cause. Goal-directedness causes behavior, not vice-versa.

Overall, I'm quite open to the notion that goal-directedness must be defined behaviorally, but the arguments in this post do not lend any significant support to that notion.

Thanks for the comment, and sorry for taking that long to answer, I had my hands full with the application for the LTFF.

Except your first one (I go into that below), I agree with all your criticisms of my argument. I also realized that the position opposite of mine was not to think that we care about something else than the behavior, but that specifying what matters in the behavior might require thinking about the insides. I still disagree, but I don't think I have conclusive arguments for that debate. The best I can do is try to do it and see if I fail.

About your first point:

First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal-directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of "what humans want" has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question "will this AI do what humans want?". That's why goal directedness matters.

Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It's about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I'm clearly not considering "what humans want" first, even if it would be a nice consequence.

Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It's about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I'm clearly not considering "what humans want" first, even if it would be a nice consequence.

Yeah, I definitely see that you're trying to do a useful thing here, and the fact that you're not doing some other useful thing doesn't make the current efforts any less useful.

That said, I would suggest that, if you're thinking about a notion of "goal-directedness" which isn't even intended to capture many of the things people often call "goal-directedness", then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you're trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it's likely to have.

(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)

Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction - and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.

You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.

What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It's in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.

There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.

Attempting to approach goal directedness behaviorally is, I expect, going to run into the same problems as trying to infer policy from behaviors only: you can't do it unless you make some normative assumption. This is exactly analogous to the Armstrong's No Free Lunch Theorem for value learning and, to turn it around the other way, we can similarly assign any goal whatsoever to a system based solely on its behavior unless we make some sufficiently strong normative assumption about it.

That's a very good point. I actually think we can avoid this problem, due to a couple of things:

  • As I mentioned in another comment, what I mean by behaviorally is not simply looking at the behavior, it also includes taking the intentional stance towards the system. And therefore making rather strong normative assumptions about it.
  • If we use focus, then not all systems are maximally focused towards all goals. Where I think the problem creeps back in is in the fact that many goals (like the one containing all states, which means intuitively that the goal is to reach any state) will be maximally focused for many if not all systems. My attempt at an answer is the triviality measure of the goal, as a counterweight. But it's still possible in theory to have two goals of equivalent triviality and equivalent focus; in that case I don't really know yet how to "choose".