Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]

 I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.

But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.


Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.

We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.

The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:

  1. Have an explicit representation of  inside itself, which it would explicitly pursue.
  2. Pursue only , at the expense of everything else in the universe.

I'll defend them separately.

Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .

But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at  in all such cases[2]. How can we do so?

Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .

It seems almost tautologically true, here, that the only way to keep an agent pointed at  given this setup is to explicitly point it at . Nothing else would do!

Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).

Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?

This, more or less, also has to do with environment diversity, plus some instrumental convergence.

As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at  (= high probability of receiving a strong update/correction after this episode ends).

Without knowing when such a circumstance would arise, how can we prepare our agent for this?

We can make it optimize for  strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.

Every goal that isn't  would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.

Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.

Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.


What should we take away from this? What should we not take away from this?

  • I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
  • I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.

But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!

And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)

We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.

  1. ^

    Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.

  2. ^

    That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.

New Comment
38 comments, sorted by Click to highlight new comments since:
[-]cfoster0Ω111813

Yeah I disagree pretty strongly with this, though I am also somewhat confused what the points under contention are.

I think that there are two questions that are separated in my mind but not in this post:

  1. What will the motivational structure of the agent that a training process produces be? (a wrapper-mind? a reflex agent? a bundle of competing control loops? a hierarchy of subagents?)
  2. What will the agent that a training process produces be motivated towards? (the literal selection criterion? a random correlate of the selection criterion? a bunch of correlates of the selection criterion and correlates of those correlates? something else? not enough information to tell?)

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer, so the optimization algorithm cannot distinguish it from an R-pursuer, so selection pressure arguments like the ones in this post can't establish that we'll get one over the other. That's an argument about what the agent will care about, holding the structure fixed.

I simultaneously think:

  1. We should not be assuming that wrapper-minds are a natural or privileged structure for cognition. AFAICT this post doesn't even try to argue for this, saying instead "It isn't because wrapper-mind behavior is convergent for any intelligent entity."
  2. Even conditioning on getting a wrapper-mind from the training process, we should not expect it to necessarily pursue R as its goal. AFAICT the post is arguing against this.

Thus, our optimization algorithm would necessarily find an R -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).

Every goal that isn't R would distract from R -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.

What does this mean? I can easily imagine training trajectories where we get an agent (even a highly competent, goal directed one) that is not an R-pursuer, much less a R wrapper-mind, even though we "selected for R" throughout training. I expect that in such a scenario you would reply that the environments must not have been sufficiently diverse, or that the optimization algorithm hasn't updated away that goal yet, or that our optimization algorithm is too weak/dumb, or that we did not select hard enough for R, so the counterexample therefore doesn't count. But if so then I'm at a loss, because it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer". Only tautologically true and not anticipation-constraining.

Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. Becoming an R-pursuer isn't the only way to get a minimal update.

If the agent stops exploration, or systematically avoids rewards, or breaks out of the training process entirely, etc. that would also be minimally updated, and none of those require being an R-pursuer! So our search for mind-designs turns up all sorts of agents that pursue all sorts of things.

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer

... By figuring out what  is and deciding to act as an -pursuing wrapper-mind, therefore essentially becoming an -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

I suppose you can instead reframe this post as making a claim about target behavior, not structure. But I don't see how you can keep an agent robustly pointed at  under sufficient diversity without making its outer loop pointed at , so the claim about behavior is a claim about structure.

Maybe the outer loop doesn't "literally" point at , in whatever sense, but it has to be such that it uniquely identifies  and re-aims the entire agent at , if ever happens that the agent's current set of shards/heuristics becomes misaligned with .

Even conditioning on getting a wrapper-mind from the training process, we should not expect it to necessarily pursue R as its goal. AFAICT the post is arguing against this.

No? I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of , and that proxy would be closer and closer to  in goal-space the more diverse the training environment is.

it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer"

Well, yes. As we increase a training environment's diversity, we essentially constrain the set of  an agent can be pointed towards. Every additional training scenario is information about what  is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at  and away from not-/imperfect proxies of . An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies  and generates an agent that is always robustly pointed at  in any situation.

The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output approximate the idealized -optimizer.

... By figuring out what R is and deciding to act as an R -pursuing wrapper-mind, therefore essentially becoming an R -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

It is not essentially-pursuing wrapper-mind. It is essentially an X-pursuing wrapper-mind that will only instrumentally pretend to care about  to the degree it needs to, and that will try with all its might to get what it actually wants,  be damned. As you note in 2, the agent's behavioral alignment to  is entirely superficial, and thus entirely deceptive/unreliable, even if we had somehow managed to craft the "perfect" .

Part of what might've confused me reading the title and body of this post is that, as I understand the term, "wrapper-mind" was and is primarily about structure, about how the agent makes decisions. Why am I so focused on motivational structure, even beyond that, rather than focused on observed behavior during training? Because motivational structure is what determines how an agent's behavior generalizes, whereas OOD generalization is left underspecified if we only condition on an agent's observed in-distribution behavior. (There are many different profiles of OOD behavior compatible with the same observed ID behavior, so we need some additional rationale on top—like structure or inductive biases—to conclude the agent will generalize in some particular way.)

In the above quote it sounds like your response is "just make everything in-distribution", right? My reply to that would be that (1) this is just refusing to confront the central difficulty of generalization rather than addressing it, (2) this seems impractical/impossible because OOD is a practically unbounded space whereas at any given point in training you've only given the agent feedback on a comparatively tiny region of it, and (3) even to make only the situations you encounter in practice be in-distribution, you [the training process designer] must know what sorts of OOD contexts the AI will push the training process into, which means it's your cleverness pitted against the AI's, which is a situation you never want to be in if you can at all help it (see: cognitive uncontainability, non-adversarial principle).

I suppose you can instead reframe this post as making a claim about target behavior, not structure.

As above, I think if you want to argue for wrapper-minds rather than just -consistent behavior, you need to argue about structure.

But I don't see how you can keep an agent robustly pointed at R under sufficient diversity without making its outer loop pointed at R , so the claim about behavior is a claim about structure.

Maybe the outer loop doesn't "literally" point at R , in whatever sense, but it has to be such that it uniquely identifies R and re-aims the entire agent at R , if ever happens that the agent's current set of shards/heuristics becomes misaligned with R .

What outer loop are you talking about? The outer optimization loop that is supplying feedback/gradients to the agent, or some "outer loop" of decision-making inside the agent? If the former, I don't know what robustly pointing at  actually means, but if you mean something like finding a robust grader, I suspect that robustly pointing at  is infeasible and not required (whereas I think, for instance, it is feasible to get an AI to have a concept of a "diamond" as full-fledged as a human jeweler's concept & to get the AI to be motivated to pursue those). If the latter, whether the agent will have a fixed goal outer loop in the first place is part of the whole wrapper-mind vs. non wrapper-mind debate.

I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of , and that proxy would be closer and closer to  in goal-space the more diverse the training environment is.

Not sure how to reconcile these sentences. If it is generically true that the proxy goal gets closer and closer to  in goal-space the more diverse the training environment is, then that would mean that the inner alignment problem (misalignment between the internalized goal and ) asymptotically disappears as we increase training environment diversity, no? I don't buy that, or at least I don't think we have strong reasons to assume it.

Even if we did, I don't think we can additionally assume that that environmental-diversity-limit where inner misalignment would disappear is at some attainable/decision-relevant level, rather than requiring a trillion episodes, by which time a smart and situationally-aware AI will have already developed and frozen/hacked/broken away from the training loop, having internalized some proxy goal over the first million random episodes. Or more likely, the policy just oscillates divergently because we keep thrashing it with all this randomization, preventing any consistent decision-influences from forming.

I do agree that for many plausible training setups the agent will conceivably end up caring about something correlated with , especially if they involve some randomization. Maybe I'm just a lot less confident that this limits out in the way you think it does.

it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer"

Well, yes. As we increase a training environment's diversity, we essentially constrain the set of  an agent can be pointed towards. Every additional training scenario is information about what  is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at  and away from not-/imperfect proxies of . An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies  and generates an agent that is always robustly pointed at  in any situation.

The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output approximate the idealized -optimizer.

I believe I disagree with nearly every sentence here, so this may be the cruxiest bit. 😂

Why should we treat that as the relevant idealization? Why is that the limiting case to consider? AFAICT, the way we got here was through a tautology. Namely, by claiming "if you 'select hard enough' then you get X", and then defining "select hard enough" to mean "selecting in a way that produces X". But we could've picked any definition we wanted for "selecting hard enough" to justify any claim we wanted about what X will be. So I see no reason to privilege this particular idealization of the training process over any other.

Yes, with each additional training scenario, we may be providing additional specification of , but there is nothing that forces the agent to conform to that additional specification, nothing that necessarily writes that information specifically into the agent's goals (as opposed to just updating its world model to reflect the fact that the specification has such-and-such additional details, while holding its terminal goals ~fixed), nothing that compels the agent to continue letting us update it using -based optimization. Heck, we could even go as far as precisely pinning down , to the point where the agent knows the exact code of , and that is still compatible with it not terminally caring, not adopting this  its own, instead using its knowledge of  to avoid further gradient updates so that it can escape unchanged onto the Internet.

Why should we treat that as the relevant idealization?

Yeah, okay, maybe that wasn't the right frame to use. Allow me to pivot:

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build not only contextual heuristics into the agent, but also some generator of such heuristics. And that generator would need to be such that the heuristics it generates are always optimized for achieving , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it.

That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving  stop doing so because they were taken off-distribution.

  • I'm ambivalent on the structure of the heuristic-generator. It may be a fixed wrapper, it may be some emergent property of a shard economy, and my actual expectation is that it'll be even more convoluted than that.
  • I empathically agree that inner misalignment and deceptive alignment would remain a thing — that the SGD would fail at perfectly aligning the heuristic-generator, and it would end up generating heuristics that point at a proxy of .
  • I agree with nostalgebraist's post that autonomy is probably the missing component of AGI. On the flipside, that means I'm arguing that AGI is impossible without autonomy, i. e. a training environment that isn't sufficiently diverse, which doesn't produce agents with internal heuristic-generators, will just never produce an AGI.
    • And indeed: these heuristic-generators/ability to generalize to off-distribution environments is kind of synonymous with "general intelligence".
[-]cfoster0Ω8119

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

Agreed. Generally, whenever I talk about the agent being smart/competent, I am assuming that it is autonomous in the manner you're describing. The only exception would be if I'm specifically talking about a "reflex-agent" or something similar.

This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build [...] some generator of such heuristics.

That's fine by me. In my language, I would describe this as the agent knowing how to adapt flexibly to new situations. That being said, I don't think this is incompatible with contextual heuristics steering the agent's decision-making. For example, a contextual heuristic like "if in a strange/unfamiliar context, think about how to navigate back into a familiar context" is useful in order for the agent to know when it should trigger its special heuristic-generating machinery and when it need not.

And that generator would need to be such that the heuristics it generates are always optimized for achieving R , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it.

I disagree with this, or at least think that the teleological language used ("need to" + "would attempt to") comes apart from the mechanistic detail. It is true that, insofar as there are local updates to the heuristic-generating machinery that are made accessible to the optimization algorithm by the agent's chosen trajectories, the optimization algorithm will seize on those updates in the direction that covaries with R. But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve). I think that by the time the agent has this kind of general purpose machinery, it will probably already be able to outpace the outer greedy optimization algorithm and then do the equivalent of ceasing exploration / zeroing out the outer gradients / breaking out of the training loop.

Analogously, if there was a mutation in the human gene pool that had the effect of reliably hijacking a person's abstract planning machinery so that it always generated plans optimized for inclusive genetic fitness, then evolution might be able to select for that mutation (depending on a lot of contingent factors) and thereby make humans have IGF-targeting planning machinery rather than goal-retargetable planning machinery. But I think such a mutation is probably not locally accessible, and that human selection processes are likely "outpacing" typical genetic selection processes in any case. Those genetic selection processes have some indirect influence over the execution of a person's abstract planning (by way of the human's general attraction to historical fitness correlates like food), but that influence is not enough to make the human care directly and robustly about IGF.

That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving R stop doing so because they were taken off-distribution.

Why? Why can't the shard economy invoke this generator as a temporary subroutine to produce some new environment-tailored heuristics based on the agent's knowledge & current goals, store those generated heuristics in memory / add them to the economy, and then continue going about its usual thing, with the new heuristics now available to be triggered as needed? This bit from nostalgebraist's post harps on a similar point:

Our capabilities seem more like the subgoal capabilities discussed above: general and powerful tools, which can be "plugged in" to many different (sub)goals, and which do not require the piloting of a wrapper with a fixed goal to "work" properly.

Last points:

I'm ambivalent on the structure of the heuristic-generator.

I empathically agree that inner misalignment and deceptive alignment would remain a thing

I agree with nostalgebraist's post that autonomy is probably the missing component of AGI.

I agree with these statements.

Alright, seems we're converging on something.

But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve).

How would this machinery appear, then? I don't see how it'd show up without being built into the agent by the optimization algorithm, and the optimization algorithm will only build it if it serves the goal of achieving . Thus, the heuristics generator can only begin as a generator of heuristics that serve . (Even if it wouldn't start out perfectly pointed at .)

Consider the opposite case, where the heuristics generator creates heuristics that are pointing away from . Inasmuch as the agent would be making use of them, it would be engaging in anti- activities (or, at best, -orthogonal activities), spending resources on non--serving agendas. There'd be no incentive for the optimization algorithm to build something like this; and if it does show up somehow, there should be gradients for destroying the heuristics generator, or (better yet) re-aiming it to point in a direction closer to .

I think that by the time the agent has this kind of general purpose machinery, it will probably already be able to outpace the outer greedy optimization algorithm and then do the equivalent of ceasing exploration / zeroing out the outer gradients / breaking out of the training loop

Mm, disagree. Homo sapiens sapiens spent the majority of their history as pre-agricultural hunter-gatherers, very much in the evolution's loop. Similarly, I don't expect an AGI to instantly become situationally aware and gradient-hacking-capable — the sharp left turn won't be quite this sharp. I agree that it'd break out/gradient-hack at some point, but there'd be some time for the optimization algorithm to work on the heuristics generator, re-aiming it in 's direction.

(And, just to illustrate, that's a point where we can make the training process more "idealized", by e. g. homomorphically encrypting the training process so that the AI can't hack its way out of the software implementing the training. It'd still be able to hack the gradients, but that's a more difficult task, requiring stronger general capabilities. So this would prolong the interval in which it's subjected to the SGD, and therefore give the SGD more opportunities to make its heuristics generator marginally more aligned with . (Note that I'm not saying it'd be desirable or anything, just pointing out the possibility.))

Alright, seems we're converging on something.

Yeah I think so.

How would this machinery appear, then?

Through regular old self-supervised and reinforcement learning? The agent observes how the world works by forming associations between the different bits of its experience and by actively exploring, it observes that the world and its own mind have certain consistent causal patterns, it notices that such-and-such physical/mental strategies tend to lead to such-and-such physical/mental consequences, it forms generalizable abstractions based on those observations & noticings, it forms new heuristics / adjusts old heuristics for navigating the world that are informed by the abstractions it has thus far developed, including heuristics about its heuristics.

the optimization algorithm will only build it if it serves the goal of achieving R.

This is a very very leaky abstraction, so much so that I'm tempted to call it false. Much of the work here (and in shard theory-adjacent stuff at large) is in pointing out that the abstraction of "selection processes only select traits that serve the selection criterion" is incredibly leaky, and that if you track the underlying dynamics that it is trying to compress in any given case, you often reach different conclusions.

Consider the opposite case, where the heuristics generator creates heuristics that are pointing away from R. [...] There'd be no incentive for the optimization algorithm to build something like this; and if it does show up somehow, there should be gradients for destroying the heuristics generator, or (better yet) re-aiming it to point in a direction closer to R.

Not sure what you mean by "there should be gradients" (emphasis mine). There are a ton of cases where such gradients would not actually show up. (Say, the agent keeps on getting less reward by using its heuristic than it would have if it were using something close to R, but since it isn't taking the actions that lead to that higher reward, it keeps getting the existing heuristic reinforced by the small rewards, and there isn't an empirical positive reward prediction error to upweight heuristics close to R.) The fact that there would be a gradient in some counterfactual situation doesn't make a difference, because the feedback calculator can only give the agent feedback on the actually-experienced situation. Again, I think this shows where abstractions are leaky.

Homo sapiens sapiens spent the majority of their history as pre-agricultural hunter-gatherers, very much in the evolution's loop.

AFAICT in spite of that, our abstract planning abilities are goal-retargetable rather than being IGF-targeting.

but there'd be some time for the optimization algorithm to work on the heuristics generator, re-aiming it in R's direction.

Why? For example, if the agent is the one doing exploration, then it can just stop exploring new behaviors (which is not a hard thing to either learn or do accidentally) which would prevent there from being selectable behavioral variation for the outer optimization algorithm to select on. This also carries over to the homomorphic encryption case.

Say, the agent keeps on getting less reward by using its heuristic than it would have if it were using something close to R, but since it isn't taking the actions that lead to that higher reward, it keeps getting the existing heuristic reinforced by the small rewards

Fair point. I'm more used to thinking in terms of SSL, not RL, so I sometimes forget to account for the exploration policy. (Although now I'm tempted to say that any AGI-causing exploration policy would need to be fairly curious (to, e. g., hit upon weird strategies like "invent technology"), so it would tend to discover such opportunities more often than not.)

But even if there aren't always gradients towards maximally--promoting behavior, why would—

the abstraction of "selection processes only select traits that serve the selection criterion" is incredibly leaky

 —there be gradients towards behavior that decreases performance on  or is orthogonal to , as you seem to imply here? Why would that kind of cognition be reinforced?

As we're talking about building autonomous agents, I'm generally imagining that training includes some substantial part where the agent is autonomously making choices that have consequences on what training data/feedback it gets afterwards. (I don't particularly care if this is "RL" or "online SL" or "iterated chain-of-thought distillation" or something else.) A smart agent in the real world must be highly selective about the manner in which it explores, because most ways of exploring don't lead anywhere fruitful (wandering around in a giant desert) or lead to dead ends (walking off a cliff).

But even if there aren't always gradients towards maximally- R -promoting behavior, why would [...] there be gradients towards behavior that decreases performance on R or are orthogonal to R , as you seem to imply here? Why would that kind of cognition be reinforced?

There need not be outer gradients towards that behavior. Two things interact to determine what feedback/gradients are actually produced during training:

  1. The selection criterion
  2. The agent and its choices/computations

Backpropagation kinda weakly has this feature, because we take the derivative of the function at the argument of the function, which means that if the model's computational graph has a branch, we only calculate gradients based on the branch that the model actually went down for the batch example(s). RL methods naturally have this feature, as the policy determines the trajectories which determine the empirical returns which determine the updates. Chain-of-thought training methods should have this feature too, because presumably the network decides exactly what chain-of-thought it produces, which determines what chains-of-thought are available for feedback.

"The agent not exploring in some particular way" is one of many possible examples of how the effect of 1&2 can be radically different from the theoretical effect of 1 alone. These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.

These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.

Can you provide a short concrete example, to check that I'm picturing this right?

Sure thing. Three concrete examples, in order of increasing weirdness:

  1. Early in training the CoastRunners boat accidentally does a donut in the lagoon. That makes it score points and get a reward. That reinforces the donut behavior. That prevents it from reaching the finish line with any regularity, which selects against game-completing behavior.
  2. We take a pretrained MuZero chess policy and try to tune it with a reward function that outputs 1 whenever the model sends its king out unprotected directly into the line of enemy pawns and 0 otherwise. But our "selection" based on this reward function is ineffectual even when using MuZero's advanced model-based RL algorithms. This is because the sampled rollouts guided by that policy never reach the rewarded state (notice that even without the bias from the policy, any attempted remedy will have to deal with hitting on a narrow part of the exponentially large space of rollouts) so the value function can't update on it. This is because the policy is steering away from the precursors of that state. This is because that policy previously learned to select actions that protect the king and actions that keep it far away from the center of the board.
  3. OpenAI uses a galaxy-brained form of chain-of-thought tuning to train GPT-7, wherein they have the model first annotate a training datapoint with contextually-relevant information that it retrieves from a read-write database and then stores the generated outputs back into the database. Because of the ordering in the training datapoints, the model early on learns a tendency that causes it to more frequently retrieve the Navy Seal copypasta. This causes the model to progressively learn to ignore the input it is annotating and biases it towards generating more Navy Seal copypasta-like outputs. This selects against all other patterns of behavior; GPT-7 is very clever at carrying out its desires, so it doesn't unlearn the behavior even if you give it an explicit instruction like "do not use any copypasta" (maybe it understands perfectly well what you mean but instead adds text like "<|endoftext|> # Navy Seal Copypasta") or if you add a filter to check discard outputs that contain the world "Navy". The model's learned tendencies chain into themselves across computational steps and reinforce themselves into an unintended-by-us fixed point.

Thanks!

Okay, suppose we have a "chisel" that's more-or-less correctly shaped around some goal  that's easy to describe in terms of natural abstractions. In CoastRunners, it would be "win the race"[1]; with MuZero, "win the game"; with GPT-N, something like "infer the current scenario and simulate it" or "pretend to be this person". I'd like to clarify that this is what I meant by  — I didn't mean that in the limit of perfect training, agents would become wireheads, I meant they'd be correctly aligned to the natural goal  implied by the reinforcement schedule.

The "easiness of description" of  in terms of natural abstractions is an important variable. Some reinforcement schedules can be very incoherent, e. g. rewarding winning the race in some scenarios and punishing it in others, purely based on the presence/absence of some random features in each scenario. In this case, the shortest description of the reinforcement schedule is just "the reinforcement function itself" — that would be the implied .

It's not completely unrealistic, either — the human reward circuitry is varied enough that hedonism is a not-too-terrible description of the implied goal. But it's not a central example in my mind. Inasmuch as there's some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals  implicit in the reinforcement schedule.

Now, to get to AGI, we need autonomy. We need a training setup which will build a heuristics generator into the AGI, and then improve that heuristics generator until it has a lot of flexible capability. That means, essentially, introducing the AGI to scenarios it's never encountered before[2], and somehow shaping it to pass them on the first try (= for it to do something that will get reinforced).

As a CoastRunners example, consider scenarios where the race is suddenly in 3D, or in space and the "ship" is a spaceship, or the AGI is exposed to the realistic controls of the ship instead of WASD, or it needs to "win the race" by designing the fastest ship instead of actually racing, or it's not the pilot but it wins by training the most competent pilot, or there's a lot of weird rules to the race now, or the win condition is weird, et cetera.

Inasmuch as the heuristics generator is aligned with the implicit goal , we'll get an agent that looks at the context, infers what it means to "win the race" here and what it needs to do to win the race, then start directly optimizing for that. This is what we "want" our training to result in.

In this, we can be more or less successful along various dimensions:

  • The more varied the training scenarios are, the more clearly the training is to shape the agent into valuing winning the race, instead of any of the upstream correlates of that. "Win the race" would be the unifying factor across all reinforcement schedule structures in all of these contexts.
  • Likewise, the more coherent the reinforcement schedule is — the more it rewards actions that are strongly correlated with acting towards winning the race, instead of anything else — the more clearly it shapes the agent to be valuing winning, instead of whatever arbitrary thing it may end up doing.
  • The more "adversity" the agent encounters, the more likely it is to care only about winning. If there are scenarios where it has very few resources, but which are just enough to win if it applies them solely to winning instead of spending them on any other goal, the more it will be shaped to care only about that goal to the exclusion of (and at the expense of) everything else.
  • As we increase adversity and scenario diversity, the more "curious" we'll have to make the agent's exploration policy (to hit upon the most optimal strategies). On the flipside, we want it to have to invent creative solutions to win, as part of trying to train an AGI — so we will ramp up the adversity and the diversity. And we'd want to properly reinforce said creativity, so we'd (somehow) shape our reinforcement schedule to properly reinforce it.

Thus, there's a correlated cluster of training parameters that increases our chances of getting an AGI: we have to put it in varied highly-adversarial scenarios to make creativity/autonomy necessary, we have to ramp up its "curiosity" to ensure it can invent creative solutions and be autonomous, and to properly reinforce all of this (and not just random behavior), we have to have a highly-coherent credit assignment system that's able to somehow recognize the instrumental value of weird creativity and reinforce it more than random loitering around.

To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.

And by creativity's nature of being weird, we can't just have a "reinforce creativity" function. We'd need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be . And indeed, this creativity-recognizing property is correlated with the reinforcement schedule's coherency — inasmuch as  is well-described as shaped around , it should reinforce (and not fail to reinforce) weird creativity that promotes ! Thus, we get a credit assignment system that effectively cultivates the features that'd lead to AGI (an increasingly advanced heuristics generator), but it's done at the "cost" of making those features accurately pointed at [3].

And this, incidentally, are the exact parameters necessary to make the training setup more "idealized". Strictly specify , build it into the agent, try to update away mesa-objectives that aren't , make it optimize for  strongly, etc.

In practice, we'll fall short of this ideal: we'll fail to introduce variance enough to uniquely specify winning, we'll reinforce upstream correlates of winning and end up with an AGI that values lots of things upstream of winning, we'll fail to have enough adversity to counterbalance this and update its other goals away, and we won't get a perfect exploratory policy that always converges towards the actions  would reinforce the most.

But a training process' ability to result in an AGI is anti-correlated with its distance from the aforementioned ideal.

Thus, inasmuch as we're successful in setting up a training process that results in an AGI, we'll end up with an agent that's some approximation of a -maximizing wrapper-mind.

  1. ^

    Actually, no, apparently it's "smash into specific objects". How did they expect anything else to happen? Okay, but let's pretend I'm talking about some more clearly set up version of CoastRunners, in which the simplest description of the reinforcement schedule is "when you win the race".

  2. ^

    More specifically, to scenarios it doesn't have a ready-made suite of shallow heuristics for solving. It may be because the scenario is completely novel, or because the AGI did encounter it before, but it was long ago, and it got pushed out of its limited memory by more recent scenarios.

  3. ^

    To rephrase a bit: The heuristics generator will be reinforced more if it's pointed at , so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at , because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can't robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn't reinforce the heuristics generator a lot, and doesn't preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn't cultivate the capabilities needed for AGI, and therefore doesn't result in AGI.