Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Introduction: Forward and Backward Approaches

I first started thinking about deconfusing goal-directedness after reading Rohin's series of four posts on the subject. My goal was to make sense of his arguments related to goal-directedness, and to understand whether alternatives where possible and/or viable. I thus thought of this research as quite naturally following two complementary approach:

  • A forward approach, starting from the intuitions about goal-directedness and trying to find a satisfactory formalization from a philosophical standpoint.
  • A backward approach, starting from the arguments on AI risk using goal-directedness (like Rohin's), and trying to find what about goal-directedness made these arguments work.

In the end, both approaches would meet in the middle and inform each other, hopefully settling whether the cluster of concepts around goal-directedness was actually relevant for the arguments using the latter.

The thing is, I became less and less excited about the backward approach over time, to the point that I don't work on it anymore. I sincerely feel like most of the value will come from nailing the forward approach (with additional constraints mentioned below).

Yet I never wrote anything about my reason for this shift, if only because I never made explicit this approach, except with my collaborators Michele Campolo and Joe Collman. Since Daniel Kokotajlo pushed for what is essentially the backward approach in a comment in our Literature Review on Goal-Directedness, I believe this is the perfect time to do so.

Trying the Backward Approach

How do we start to investigate goal-directedness through the backward approach? Through the arguments about AI risks relying on goal-directedness. Let's look at the arguments for convergent instrumental subgoals from Omohundro’s The Basic AI Drives, which require goal-directedness as mentioned by Rohin's Coherence arguments do not imply goal-directed behavior. This becomes clear through the definition of AI used by Omohundro:

To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world.

The first convergent instrumental subgoal is self-improvement:

One kind of action a system can take is to alter either its own software or its own physical structure. Some of these changes would be very damaging to the system and cause it to no longer meet its goals. But some changes would enable it to reach its goals more effectively over its entire future. Because they last forever, these kinds of self-changes can provide huge benefits to a system. Systems will therefore be highly motivated to discover them and to make them happen. If they do not have good models of themselves,they will be strongly motivated to create them though learning and study. Thus almost all AIs will have drives towards both greater self-knowledge and self-improvement.

What feature of goal-directedness is used in this argument? That a goal-directed system will do things that clearly improve its abilities to reach its goal. Well... disentangling that from the intuitions around goal-directedness might prove difficult.

Let's look at the second convergent instrumental subgoal, rationality:

So we’ll assume that these systems will try to self-improve. What kinds of changes will they make to themselves? Because they are goal directed, they will try to change them-selves to better meet their goals in the future. But some of their future actions are likely to be further attempts at self-improvement. One important way for a system to better meet its goals is to ensure that future self-improvements will actually be in the service of its present goals. From its current perspective, it would be a disaster if a future version of itself made self-modifications that worked against its current goals. So how can it ensure that future self-modifications will accomplish its current objectives? For one thing, it has to make those objectives clear to itself. If its objectives are only implicit in the structure of a complex circuit or program, then future modifications are unlikely to preserve them. Systems will therefore be motivated to reflect on their goals and to make them explicit.

In an ideal world, a system might be able to directly encode a goal like “play excellent chess” and then take actions to achieve it. But real world actions usually involve tradeoffs between conflicting goals. For example, we might also want a chess playing robot to play checkers. It must then decide how much time to devote to studying checkers versus studying chess. One way of choosing between conflicting goals is to assign them real-valued weights. Economists call these kinds of real-valued weightings “utility functions”. Utility measures what is important to the system. Actions which lead to a higher utility are preferred over those that lead to a lower utility.

Once again, the property on which these arguments rely is that goal-directed system try to be better at accomplishing their goals.

The same can be said for all convergent instrumental subgoals in the paper, and as far as I know, every argument about AI risks using goal-directedness. In essence, the backward approach finds out that what is used in the argument is the fundamental property that the forward approach is trying to formalize. This means in turn that we should work on the deconfusion of goal-directedness from the philosophical intuitions instead of trying to focus on the arguments for AI risks, because these arguments depend completely on the intuitions themselves.

Extended Forward Approach

Of course, arguments on AI risks have a role to play. What we want is to find if they hold or not, in the end. So the properties of goal-directedness should help clarify these arguments, or at least relate to them.

The model I'm working with is thus (where the flow of arrow capture the successive steps): Definition of goal-directedness -> Test against philosophical intuitions -> Test against AI risk arguments. I don't think it's especially new either; my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.

I also am not pretending that the only way to make the AI risk arguments mentioned above work is through formalization of the cluster of intuitions around goal-directedness. There might be another really important concept that tie these arguments together, without any link to goal-directedness. It's just that at the moment, there seem to be a confluence of arguments around this concept and the intuitions linked with it. I'm taking the bet that following this lead is the best way we have right now to poke at these arguments and make them cleaner or break them.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 1:52 PM

Hmmm, it doesn't seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you're only using intuitions about human goal-directedness, then you'll probably miss out on a bunch of important ideas. Whereas if you're using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.

Meanwhile, I agree that the backward approach will fail if we try to find "the fundamental property that the forward approach is trying to formalise". But this seems like bad philosophy. We shouldn't expect there to be a formal or fundamental definition of agency, just like there's no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it's most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying "what cognition leads to the pursuit of convergent instrumental goals"; and another valuable step is saying "what ways of building minds lead to that cognition"; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of "conceptual engineering".

So I count my approach as a backwards one, consisting of the following steps:

  1. It's possible to build AGIs which are dangerous in a way that intuitively involves something like "agency".
  2. Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
  3. By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
  4. We can then try to design training processes which prevent some of those traits from arising.

(Another way of putting this: the backwards approach works when you use it to analyse concepts as being like network 1, not network 2.)

If you're still keen to find a "fundamental property", then it feels like you'll need to address a bunch of issues in embedded agency.

Thanks for both your careful response and the pointer to Conceptual Engineering!

I believe I am usually thinking in terms of defining properties for their use, but it's important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position:

Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He's got a couple really interesting books, one on knowledge one on causation, and big parts of what he's doing are informed by the long history of conceptual analysis. He'll go through the puzzles, show a formalization, but then does a further thing, which philosophers need to take very seriously and should do more often. He says, look, I have this core idea, but to deploy it I need to know the problem domain. The shape of the problem domain may put additional constraints on the mathematical, precise version of the concept. I might need to tweak the core idea in a way that makes it look unusual, relative to ordinary language, so that it can excel in the problem domain. And you can see how he's making use of this long history of case-based, conceptual analysis-friendly approach, and also the pragmatist twist: that you need to be thinking relative to a problem, you need to have a constraint which you can optimize for, and this tells you what it means to have a right or wrong answer to a question. It's not so much free-form fitting of intuitions, built from ordinary language, but the solving of a specific problem.

So my take is that there is probably a core/basic concept of goal-directedness, which can be altered and fitted to different uses. What we actually want here is the version fitted to AI Alignment. So we could focus on that specific version from the beginning; yet I believe that looking for the core/basic version and then fitting it to the problem is more efficient. That might be a give source of our disagreement.

(By the way, Joe Halpern is indeed awesome. I studied a lot of his work related to distributed systems, and it's always the perfect intersection of a philosophical concept and problem with a computer science treatement and analysis.)

Hmmm, it doesn't seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you're only using intuitions about human goal-directedness, then you'll probably miss out on a bunch of important ideas. Whereas if you're using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.

I resolve the apparent paradox that you raise by saying that the intuitions are about the core/basic idea which is close to human goal-directedness; but that it should then be fitted and adapted to our specific application of AI Alignment.

Meanwhile, I agree that the backward approach will fail if we try to find "the fundamental property that the forward approach is trying to formalise". But this seems like bad philosophy. We shouldn't expect there to be a formal or fundamental definition of agency, just like there's no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it's most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying "what cognition leads to the pursuit of convergent instrumental goals"; and another valuable step is saying "what ways of building minds lead to that cognition"; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of "conceptual engineering".

Agreed. My distinction of forward and backward felt shakier by the day, and your point finally puts it out of its misery.

So I count my approach as a backwards one, consisting of the following steps:

  1. It's possible to build AGIs which are dangerous in a way that intuitively involves something like "agency".
  2. Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
  3. By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
  4. We can then try to design training processes which prevent some of those traits from arising.

My take on your approach is that we're still at 3, and we don't have yet a good enough understanding of those traits/properties to manage 4. As for how to solve 3, I reiterate that finding a core/basic version of goal-directedness and adapting it to the usecase seems to way to go for me.

Thanks for this! My hot take: I'm worried about this methodology because it seems that the "test against philosophical intuitions" step might rule out (or steer you away from) definitions of goal-directedness that work fine for the AI risk arguments. I think I agree that philosophical intuitions have a role to play, but it's more like a prior-constructing or search-guiding role than a test that needs to be passed or a constraint that needs to be met. And perhaps you agree with this also, in which case maybe we don't disagree at all.

Yep, we seem to agree.

It might not be clear from the lit review, but I personally don't agree with all the intuitions, or not completely. And I definitely believe that a definition that throw some part of the intuitions but applies to AI risks argument is totally fine. It's more that I believe the gist of these intuitions is pointing in the right direction, and so I want to keep them in mind.

my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.

I agree with this summary.

I suspect Daniel Kokotajlo is in a similar position as me; my impression was that he was asking that the output be that-which-makes-AI-risk-arguments-work, and wasn't making any claims about how the research should be organized.

Good to know that my internal model of you is correct at least on this point.

For Daniel, given his comment on this post, I think we actually agree, but that he puts more explicit emphasis on the that-which-makes-AI-risk-arguments-work, as you wrote.