Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This sequence already contains a couple of blog posts, exploring different aspects of goal-directedness. But one question has never been fully addressed: what constraints should a good formalization of goal-directedness satisfy? An answer is useful both for people like me which study this topic, and for people trying to assess the value of this research. The following is my personal view, as always informed with discussion with Michele Campolo, Joe Collman and Sabrina Tang.

So what makes a good formalization of goal-directedness? The first part comes from what it means to formalize a set of philosophical intuitions. On that front, I agree with Vanessa in her research agenda:

Although I do not claim a fully general solution to metaphilosophy, I think that, pragmatically, a quasiscientific approach is possible. In science, we prefer theories that are (i) simple (Occam's razor) and (ii) fit the empirical data. We also test theories by gathering further empirical data. In philosophy, we can likewise prefer theories that are (i) simple and (ii) fit intuition in situations where intuition feels reliable (i.e. situations that are simple, familiar or received considerable analysis and reflection). We can also test theories by applying them to new situations and trying to see whether the answer becomes intuitive after sufficient reflection.

So the first step is to fit the main intuitions about goal-directedness. This is the point of ideas like focus, short descriptions, and locality. The core intuitions should probably emerge from a community discussion, such that a consensus is reached. Then, fitting the intuitions looks like an optimization problem: for each formalization, there is a "distance" to the intuitions. The point is to minimize this distance, or at least come close to the minimum.

Yet this is not a pure philosophical endeavor. I study goal-directedness to understand if Rohin's position in thissequenceofposts is right. I want to know if we as a community should invest time and efforts into less goal-directed approaches. This depends on two propositions by Rohin: that a less goal-directed system is not necessarily trivial or uncompetitive; and that being less goal-directed removes some safety issues, like convergent instrumental subgoals and wireheading.

Thus our initial optimization problem is actually a constrained optimization problem: minimizing the distance to the intuitions, while ensuring that less-goal directed systems are not necessarily trivial and that they don't suffer from the aforementioned safety issues.

Now we have a clean description of the problem. And this entails two success modes in the limit: either finding a good enough solution to the optimization problem (positive answer), or showing that no feasible solution is good enough to capture the intuitions (negative answer). The first case would vindicate Rohin and justify a research investment into the less goal-directed approaches. The second case would tell us that if there is a concept satisfying the constraints, it is not really linked to goal-directedness. Obviously, we might fail to reach either limit. But providing good evidence for one or the other would already be a big step.

Viewing the study of goal-directedness in these terms also informs how I go about it: by focusing on the optimization, while regularly checking for the constraints. Without an attempt at formalization, I don't know how to study the non triviality and the safety questions. I thus try to address as much intuitions as possible, and then see if the resulting theory survives contact with the constraints. Then I adapt the theory in response.

This sequence already contains a couple of blog posts, exploring different aspects of goal-directedness. But one question has never been fully addressed: what constraints should a good formalization of goal-directedness satisfy? An answer is useful both for people like me which study this topic, and for people trying to assess the value of this research. The following is my personal view, as always informed with discussion with Michele Campolo, Joe Collman and Sabrina Tang.

So what makes a good formalization of goal-directedness? The first part comes from what it means to formalize a set of philosophical intuitions. On that front, I agree with Vanessa in her research agenda:

So the first step is to fit the main intuitions about goal-directedness. This is the point of ideas like focus, short descriptions, and locality. The core intuitions should probably emerge from a community discussion, such that a consensus is reached. Then, fitting the intuitions looks like an optimization problem: for each formalization, there is a "distance" to the intuitions. The point is to minimize this distance, or at least come close to the minimum.

Yet this is not a pure philosophical endeavor. I study goal-directedness to understand if Rohin's position in this sequence of posts is right. I want to know if we as a community should invest time and efforts into less goal-directed approaches. This depends on two propositions by Rohin: that a less goal-directed system is not necessarily trivial or uncompetitive; and that being less goal-directed removes some safety issues, like convergent instrumental subgoals and wireheading.

Thus our initial optimization problem is actually a constrained optimization problem: minimizing the distance to the intuitions, while ensuring that less-goal directed systems are not necessarily trivial and that they don't suffer from the aforementioned safety issues.

Now we have a clean description of the problem. And this entails two success modes in the limit: either finding a good enough solution to the optimization problem (positive answer), or showing that no feasible solution is good enough to capture the intuitions (negative answer). The first case would vindicate Rohin and justify a research investment into the less goal-directed approaches. The second case would tell us that if there is a concept satisfying the constraints, it is not really linked to goal-directedness. Obviously, we might fail to reach either limit. But providing good evidence for one or the other would already be a big step.

Viewing the study of goal-directedness in these terms also informs how I go about it: by focusing on the optimization, while regularly checking for the constraints. Without an attempt at formalization, I don't know how to study the non triviality and the safety questions. I thus try to address as much intuitions as possible, and then see if the resulting theory survives contact with the constraints. Then I adapt the theory in response.

Rinse and repeat.