Goal-directedness is the term used by the AI Safety community to point to a specific property: following a goal. It comes from Rohin Shah's post in his sequence, but the intuition pervades many safety issues and current AI approaches. Yet it lacks a formal definition, or even a decomposition into more or less formal subcomponents.
Which questions we want to answer about goal-directed systems underlies the sort of definition we're looking for. There are two main questions that Rohin asks in his posts:
- Are non goal-directed systems or less goal-directed ones inherently safer than fully goal-directed ones?
- Can non-goal-directed systems or less goal-directed ones be competitive with fully goal-directed ones?
Answering these will also answer the really important meta-question: should we put resources into non-goal-directed approaches to AGI?
Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.
Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness. But I wanted to make it explicit.
Reiterating the point of the post: goal-directedness is a property of behavior, not internal structure. By this I mean that given the complete behavior of a system over all environment, goal-directedness is independent of what's inside the system. Or equivalently, if two systems always behave in the same way, their goal-directedness is the same, regardless of if one contains a big lookup table and the other an homonculus.
This is not particularly original: Dennett's intentional stance pretty much says the same thing. (The Intentional Stance, p 15)
Then I will argue that any object -- or as I shall say, any system -- whose behavior is well predicted by this strategy [considering it as moving towards a goal] is in the fullest sense of the word a believer. What it is to be a true believer is to be an intentional system, a system whose behavior is reliably and voluminously predictable via the intentional strategy.
Why write a post about it, then? I'm basically saying that our definition should depend only on observable behavior, which is pretty obvious, isn't it?
Well, goal is a very loaded term. It is a part of the set of mental states we attribute to human beings, and other agents, but that we are reluctant to give to anything else. See how I never used the word "agent" before in this post, preferring "system" instead? That was me trying to limit this instinctive thinking about what's inside. And here is the reason why I think this post is not completely useless: when looking for a definition of goal-directedness, the first intuition is to look for the internal structure. It seems obvious that goals should be somewhere "inside" the system, and thus that what really matters is the internal structure.
But as we saw above, goal-directedness should probably depend only on the complete behavior of the system. That is not to say that the internal structure is not important or useful here. On the contrary, this structure, in the form of source code for example, is usually the only thing we have at our disposal. It serves to compute goal-directedness, instead of defining it.
We thus have this split:
- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
What I see as a mistake here, a mistake I personally made, is to look for the definition in the internal structure. To look at some neural net, or some C program, and try to find where the goals are and what makes the program follow them. Instead, I think we should define and formalize goal-directedness from the ideal context of knowing the full behavior of the system, and then use interpretability and formal methods to extract what's relevant to this definition from the internal structure.
Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.