Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

tl;dr: Most good plans involve taking steps to make better plans. Making better plans is the convergent instrumental goal, of which all familiar convergent instrumental goals are an instance. This is key to understanding what agency is and why it is powerful.

Planning means using a world model to predict the consequences of various courses of actions one could take, and taking actions that have good predicted consequences. (We think of this with the handle “doing things for reasons,” though we acknowledge this may be an idiosyncratic use of “reasons.”)

We take “planning” to include things that are relevantly similar to this procedure, such as following a bag of heuristics that approximates it. We’re also including actually following the plans, in what might more clunkily be called “planning-acting.”

Planning, in this broad sense, seems essential to the kind of goal-directed, consequential, agent-like intelligence that we expect to be highly impactful. This sequence explains why.

One Convergent Instrumental Goal to Rule them All

Consider the maxim

“make there be more and/or better planning towards your goal.”

This section argues that all the classic convergent instrumental goals are special cases of this maxim.

To flesh this out a little, here are some categories of ways to follow the maxim. Remember that a planner is typically close (in terms of what it might affect via action) to at least one planner – itself – so these directions can typically be applied in the first case to the planner itself.

  • Make the planners with your goal better at planning. For example, get them new relevant data to work with¹, get them to run faster or more effective algorithms, build protections against value drift, etc.
  • Make the planners with your goal have better options. For example, move them to better locations, get them more resources, get them more power or a greater number of options to select from, have them take steps in an object-level plan towards the goal.
  • Make there be more planners with your goal. For example, keep yourself running and aligned with your goal, acquire delegates and subordinates, convince followers and converts, build successors.

 

Reviewing Omohundro's “The Basic AI Drives” and Bostrom’s “The Superintelligent Will,” we extract a list of convergent instrumental goals, and find that they are all instances of the maxim “make there be more/better planners for the current goal:”

  • Self-preservation / self-protection:
    • Make there be more planners that have your goals, focusing on reusing the existing planner, that is, preventing its destruction.
  • Self-improvement:
    • Make there be better planners with your goals, focusing on making the existing one better.
  • Resource Acquisition:
    • Make there be better planners with your goals, focusing on making the existing one able to take more effective actions.
  • Goal-content Integrity:
    • Make there be more planners that have your goals, focusing on ensuring the existing planners that have your goals keep those goals and avoid them being changed.
  • Resource-use Efficiency:
    • Same as self-improvement
  • Cognitive Enhancement:
    • Same as self-improvement
  • Creativity:
    • Same as self-improvement
  • Technological Perfection:
    • Same as resource acquisition and/or self-improvement
  • Rationality:
    • Omohundro takes this to be something like “make the utility function explicit” along with “maximize expected utility.”
    • Thus it is similar to goal-content integrity and self-improvement.
  • Utility-function preservation:
    • Similar to goal-content integrity.
  • Prevent counterfeit utility:
    • Essentially this is avoiding wireheading. Omohundro: “An important class of vulnerabilities arises when the subsystems for measuring utility become corrupted.”
    • Thus it is similar to goal-content integrity.

 

Seeking a concise, memorable-yet-accurate name for this maxim, this convergent-instrumental-goal-to-rule-them-all, we settled on P2B:

P2B ≔ Plan to P2B Better

This name emphasizes the recursive, feedback-loopy aspect of the phenomenon, which is only implicit in the idea of “better plans.”

Why P2B Works

There are several ways for planning to be ineffective, such as an inaccurate or unwieldy world model, a limited selection of actions to choose from, or inefficient use of time or other resources in predicting or assessing consequences. But often planners can and will address these issues: planning is self-correcting, thanks to P2B. A planner that didn’t recognize the importance of P2B, or was unable to do it for some reason, would not be self-correcting.

Instrumental goals are about passing the buck: if you are a planner, and you can’t achieve your final goal with a single obvious action (or sequence of actions), you can instead pass the buck to something else, typically your future self.² There will often be obvious available actions that put the receiver of the buck “closer” to achieving the final goal than you.³

P2B is what it means to generically pass the buck, closing some distance along the way. The “better” in P2B means being closer to the goal and/or generally able to close distance faster. Convergent instrumental goals are ways to close distance that work for almost any final goal, hence they are instances of P2B.

Objections & Nuances

Isn’t this trivial?

One might complain that we’ve defined P2B broadly enough that our claim about it being the convergent instrumental goal is trivial — true by definition since we defined “planner” and “better” so broadly. Fair enough; the reason we are doing this is because we think it's a useful framing/foundation for answering questions about agency, not because we think it is important or interesting on its own. We agree that the more detailed taxonomies of instrumentally convergent goals are useful. We just think it is also useful to have this unified frame. We intend to write subsequent posts making use of this frame.

Most agents aren’t planners though?

Yes they are — remember, we said above that we are defining “planner” broadly to include relevantly similar algorithms/procedures. Let’s flesh that idea out a bit more…

We think it’s OK to talk loosely about families of algorithms. When we say “planning,” for example, we are gesturing at a vague cluster of algorithms that has “for each action, imagine the expected consequences of that action, then evaluate how good those consequences are, then pick the action that had the best expected consequences” as a central example. 

We are not saying that every planning algorithm must be exactly of that form. Examining exactly where the boundaries of these concepts lie is an interesting and potentially valuable rabbit hole that we don’t feel the need to go down yet.

One thing we do wish to say is that we intend to include algorithms which behave similarly to the paradigmatic planning algorithm mentioned above. One easy way to generate algorithms like this is by automating bits of the process with heuristics. For example, maybe instead of calculating the expected consequences of every action all the time, the algorithm has a bag of heuristics that tell it when to calculate and when to not bother (and what to do instead) and the bag of heuristics tends to yield similar results for less computational expense, at least in some relevant class of environments.

(We haven’t defined “agents” yet, but you can probably guess from what we’ve said that our definition is going to resemble Dennett’s Intentional Stance.)

What about the procrastination paradox?

A planner that “P2Bs forever”, without ever taking “object-level” actions in plans that aren’t about making better future plans/planners, won’t be very effective at achieving its goal. But P2B is not the only strategy a planner should pursue — we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.

The danger of taking instrumental actions forever can show up in some toy decision problems. However, in realistic cases and for realistic planners, this is not so much of an issue — one can pursue convergent instrumental goals without ceasing to keep an eye out for opportunities to achieve terminal goals. Nevertheless, due to the automation-of-bits-of-the-core-algorithm phenomenon described above, it’s not uncommon for agents to end up pursuing P2B as a terminal goal, or even pursuing sub-sub-subgoals of P2B such as “acquire money” as final goals. As Richard Ngo pointed out, we should expect mesa-optimizers to develop terminal goals for power, survival, learning, etc. because such things are useful in a wide range of environments and therefore probably useful in the particular environment they are being trained in.


Footnotes

1. This was an “aha” moment for me: Even such everyday actions as “briefly glance up from your phone so you can see where you are going when walking through a building” are instances of following this maxim! You are looking up from your phone so that you can acquire more relevant data (the location of the door, the location of the door handle, etc.) for your immediate-future-self to make use of. Your immediate-future-self will have a slightly better world-model as a result, and thus be better than you at making plans. In particular, your immediate future self will be able, e.g., to choose the correct moment & location to grab the door handle, by contrast with your present self who is looking at Twitter and does not know where to grab.

2. This phrasing makes it sound like your goal is binary and permanent, either achieved or not. For more typical goals, which look more like utility functions, we think the same point would apply but would be more unwieldy to state.

3. To make this analogy to covering distance from a target more precise, consider something like the edit distance between the world as it is and any variation satisfying one’s final goal. The world is always changing, but the fraction of changes that are reducing this edit distance increases when there are more capable and effective planners with that goal.

4. Or perhaps even it’s heuristics all the way down, but it’s a sophisticated bag of heuristics that behaves as if it were following the calculate-expected-consequences-then-pick-the-best procedure, at least in some relevant class of environments. Note that we have the intuition that, generally speaking, substituting heuristics for bits of the core algorithm risks increasing “brittleness”/”narrowness,” i.e., problematically reducing the range of environments in which the system behaves like a planner.

New to LessWrong?

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 12:42 PM

How about "if I contain two subagents with different goals, they should execute Pareto-improving trades with each other"? This is an aspect of "becoming more rational", but it's not very well described by your maxim, because the maxim includes "your goal" as if that's well defined, right?

Unrelated topic: Maybe I didn't read carefully enough, but intuitively I treat "making a plan" and "executing a plan" as different, and I normally treat the word "planning" as referring just to the former, not the latter. Is that what you mean? Because executing a plan is obviously necessary too ....

Shooting from the hip: The maxim does include "your goal" as if that's well-defined, yeah. But this is fair, because this is a convergent instrumental goal; a system which doesn't have goals at all doesn't have convergent instrumental goals either. To put it another way: It's built into the definition of "planner" that there is a goal, a goal-like thing, something playing the role of goal, etc.

Anyhow, so I would venture to say that insofar as "my subagents should execute pareto-improving trades" does not in fact further my goal, then it's not convergently instrumental, and if it does further my goal, then it's a special case of self-improvement or rationality or some other shard of P2B.

Re point 2:

We take “planning” to include things that are relevantly similar to this procedure, such as following a bag of heuristics that approximates it. We’re also including actually following the plans, in what might more clunkily be called “planning-acting.”

We take “planning” to include things that are relevantly similar to this procedure, such as following a bag of heuristics that approximates it.

In theory, optimal policies could be tabularly implemented. In this case, it is impossible for them to further improve their "planning." Yet optimal policies tend to seek power and pursue convergent instrumental subgoals, such as staying alive. 

So I'm not (yet) convinced that this frame is useful reductionism for better understanding subgoals. It feels somewhat unnatural to me, although I am also getting a tad more S1-excited about the frame as I type this comment. 

In particular, I think this is a great point:

Instrumental goals are about passing the buck: if you are a planner, and you can’t achieve your final goal with a single obvious action (or sequence of actions), you can instead pass the buck to something else, typically your future self. There will often be obvious available actions that put the receiver of the buck “closer” to achieving the final goal than you.

This at least rings true in my experience—when I don't know what to do for my research, I'll idle by "powering up" and reading more textbooks, delegating to my future self (and also allowing time for subconscious brainstorming).

I realized we forgot to put in the footnotes! There was one footnote which was pretty important, I'll put it here because it's related to what you said. It was a footnote after the "make the planners with your goal better at planning" sub-maxim.

This was an “aha” moment for me: Even such everyday actions as “briefly glance up from your phone so you can see where you are going when walking through a building” are instances of following this maxim! You are looking up from your phone so that you can acquire more relevant data (the location of the door, the location of the door handle, etc.) for your immediate-future-self to make use of. Your immediate-future-self will have a slightly better world-model as a result, and thus be better than you at making plans. In particular, your immediate future self will be able, e.g., to choose the correct moment & location to grab the door handle, by contrast with your present self who is looking at Twitter and does not know where to grab.

Footnotes are now added in (thanks Ramana!)

In theory, optimal policies could be tabularly implemented. In this case, it is impossible for them to further improve their "planning."

That sounds wrong. Planning as defined in this post is sufficiently broad that acting like a planner makes you a planner. So if you unwrap a structural planner into a tabular policy, the latter would improve its planning (for example by taking actions that instrumentally help it accomplish the goal we can best ascribe it using the intentional stance).

Another way of framing the point IMO is that the OPs define planning in terms of computation instead of algorithm, and so planning better means facilitating or making the following part of the computation more efficient.

Your proposed reformulation of convergent subgoals sounds interesting, but I see a big flaw in your post: you don't even state the applications you're doing the deconfusion for. And in my book, the applications are THE way of judging whether deconfusion is creating valuable knowledge. So I don't know yet if your framing will help with the sort of problems related to agency and goal-directedness that I think matter.

Reserving judgment until the follow up posts then.

Fair enough; apologies. We are building to an answer to the question "What is agency and why is it powerful/competitive/incentivised/selected-for." We have a lot more to say on the subject but we decided to break it into pieces; this post is the first piece.

Exciting! Waiting for the next posts even more then.

Don't get your expectations too high, haha. We haven't written the other parts yet, maybe they won't turn out to be that good.

we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.

Hmm, given your general definition of planning, shouldn't it include realizations (and their corresponding guided actions) of the form "further thinking about this plan is worse than already acquiring some value now", so that P2B itself already includes acquiring the terminal goal (and optimizing solely for P2B is thus optimal)?

I guess your idea is "plan to P2B better" means "plan with the sole goal of improving P2B", so that it's a "non-value-laden" instrumental goal.

I guess your idea is "plan to P2B better" means "plan with the sole goal of improving P2B", so that it's a "non-value-laden" instrumental goal.

Yeah. Here's another way of putting it: The best way to achieve goal X, for almost all goals X, is to mostly focus on achieving the goal of P2B, and just devote a tiny amount of cognitive effort every once in a while to think about how to achieve X.

I enjoyed this post. Were you inspired by HCH at all? Both occupy the same mental space for me.

Thanks! I don't think so, at least not directly. I can't speak for Ramana but for me my biggest influence was Gwern's "Why Tool AIs Want To Be Agent AIs."

The idea of using a recursive acronym was definitely inspired by HCH though.

Saying that resource acquisition is in the service of improved planning (because it makes future plans better) seems like a bit of a stretch - you could just as easily say that improved planning is in the service of resource acquisition (because it lets you use resources you couldn't before). "But executing plans is how you get the goal!" you might say, and "But using your resources is how you get to the goal!" is the reply.

Maybe this is nitpicking, because I agree with you that there is some central thing going on here that is the same whatever you choose to call "more fundamental." Some essence of getting to the goal, even though the world is bigger than me. So I'm looking forward to where this is headed.

more planners

This seems tenuous compared to "more planning substrate". Redundancy and effectiveness specifically through setting up a greater number of individual planners, even if coordinated, is likely an inferior plan. There are probably better uses of hardware that don't have this particular shape.