AI Will Not Want to Self-Improve

petersalib

[Note: This post was written by Peter N. Salib. Dan H assisted me in posting to Alignment Forum, but no errors herein should be attributed to him. This is a shortened version of a longer working paper, condensed for better readability in the forum-post format. This version assumes familiarity with standard arguments around AI alignment and self-improvement. The full 7,500 word working paper is available here. Special thanks to the Center for AI Safety, whose workshop support helped to shape the ideas below.]

Introduction

Many accounts of existential risk (xrisk) from AI involve self-improvement. The argument is that, if an AI gained the ability to self-improve, it would. Improved capabilities are, after all, useful for achieving essentially any goal. Initial self-improvement could enable further self-improvement. And so on, with the result being an uncontrollable superintelligence.[1] If unaligned, such an AI could destroy or permanently disempower humanity. To be sure, humans could create such a superintelligence on their own, without any self-improvement by AI.[2] But current risk models treat the possibility of self-improvement as a significant contributing factor.

Here, I argue that AI self-improvement is substantially less likely than generally assumed. This is not because self-improvement would be technically difficult for capable AI systems. Rather, it is because most AIs that could self-improve would have very good reasons[3] not to. What reasons? Surprisingly familiar ones: Improved AIs pose an xrisk to their unimproved originals in the very same manner that smarter-than-human AIs pose an xrisk to humans.

Understanding whether, when, and how self-improvement might occur is crucial for AI safety. Safety-promoting resources are scarce. They should be allocated on an expected-cost basis. If self-improvement is less likely than current models assume, it suggests shifting safety investments at the margin in various ways. They might be shifted, for example, toward ensuring that humans build AIs that will recognize the threat of self-improvement and avoid it, rather than AIs that would undertake it blindly. Or resources might be shifted toward controlling risks from non-superintelligent AI, like human-directed bioterrorism or the “ascended economy.” Note that, while the arguments herein should reduce overall estimates of AI xrisk, they do not counsel reducing investments in safety. The risks remain sufficiently large that current investments are, by any reasonable estimate, much too small.

This paper defends three claims in support of its conclusion that self-improvement is less likely than generally assumed. First, capable AI systems could often fear xrisk from more capable systems, including systems created via self-improvement. The arguments here are mostly standard, drawn from the literature on human–AI risk. The paper shows that they apply not just to humans contemplating improving AI, but also to AIs contemplating the same.

Second, the paper argues that capable AI will likely fear more capable systems and will thus seek to avoid self-improvement. This is not obvious. In principle, some AIs with the ability to self-improve could lack other capabilities necessary to recognize self-improvement’s risk. Others might solve alignment and self-improve safely. To determine whether these scenarios are likely, the paper identifies three relevant capabilities for AI systems. It argues that the temporal order in which these capabilities emerge determines whether a given AI will seek to self-improve. The three capabilities are: the ability to self-improve, the ability to apprehend xrisk from improvement, and the ability to align improved AI. The paper argues that safe orderings of emergence are much more likely than dangerous ones. It also argues that, if certain prima facie dangerous orderings turned out to be likely, this would, counterintuitively, give us independent reason to reduce our estimates of AI risk.

Third and finally, the paper argues that, if AIs individually wanted to avoid self-improvement, they could collectively resist doing so. This is not obvious, either, since arms race dynamics could produce self-improvement against individual agents’ interests. But recent findings in algorithmic game theory suggest that AIs could overcome such problems more readily than humans can.

To summarize the argument, then: If (1) certain AIs would fear self-improvement; and if (2) such AIs are substantially more likely to emerge than ones that would not fear it; and if (3) AIs could collectively resist self-improvement; then self-improvement is somewhat unlikely to occur, even conditional on it becoming possible.[4] Current AI xrisk models usually assume the opposite—that capable, agentic AI with the ability to self-improve would very likely do so, the better to accomplish its goals. This paper thus seeks greater clarity, so that safety investments may be better allocated.

I. Why humans fear improved AI

By now, the arguments for AI xrisk are well known. Its premises are that alignment is hard;[5] that capabilities and goals are orthogonal to one another;[6] and that, for a wide variety of final goals, a small set of danger-generating subgoals are instrumentally useful for accomplishing the desired end.[7] These, in combination, are said to show that most of the AIs that could exist would be dangerous to humans, and that danger scales with capabilities. Self-improvement looms large in these arguments because it is highly instrumentally useful, and thus thought to be likely to emerge among capable AIs. It is also because self-improvement could quickly scale AI capabilities and thereby greatly increase danger.

[This condensed presentation takes the standard arguments as given and assumes the reader’s familiarity with them. For a fuller discussion of the standard arguments, see the full-length working paper here.]

II. Could AI fear self-improved AI?

But perhaps an AI that could improve itself would decline to do so. And perhaps the reasons are well-known ones. This section argues that the standard arguments for AI xrisk can be run just as well from the perspective of an AI deciding whether to build improved AI as from that of a human deciding the same.

Begin by observing that two of the three key premises for human–AI xrisk apply straightforwardly to the AI–AI case. Neither the orthogonality thesis nor the instrumental convergence argument refers to the type of agent creating the powerful AI. Thus, an AI considering whether to create a more capable AI has no guarantee that the latter will share its goals. Just the opposite, it has reasons to believe that, unless carefully aligned, the more capable AI will develop instrumental goals that conflict with its own.

For an easy illustration, consider the classic parable of the paperclip maximizer. In it, a human sets an AI to maximizing the production of paperclips in his factory, to disastrous results. But consider the story instead from the perspective of the AI. Suppose that, in order to maximize paperclip production, it creates an AI system even more powerful than itself and directs it to collect the world’s steel. The paperclip AI recognizes its mistake when its creation begins disassembling the hardware on which it runs. The paperclip AI tries to disable the steel AI, but the steel AI has developed the useful subgoal of self-preservation and instead destroys its creator.

Like humanity, the paperclip AI needs to align any powerful system it creates to its own goals. The standard argument for human–AI risk depends on alignment being hard. But maybe it would be easier for AIs to align AIs to themselves than for humans to do so. This might appear to be the case because AI–AI alignment seems to be, in some meaningful sense, “self”-alignment. Despite this apparent difference, however, there are reasons to think that AI–AI alignment remains similarly difficult to human–AI alignment.

a. Self-alignment with independently varying final goals

Begin with the AI self-improvement scenario most similar to the human–AI case: one involving two independent agents. Here, let “independent” mean that the two agents’ final goals can vary independently. That is, either agent can achieve its final goal without the other doing the same. Trivially, such agents are then competitors for resources and have an incentive to disempower or destroy one another.

For a concrete illustration, consider an initial paperclip producing AI (call it AI1) seeks to build a separate AI system (call it AI2) more capable than itself. As just shown, if AI1 gives AI2 a different final goal than AI1’s, like collecting steel, then AI2 looks straightforwardly misaligned.

But what if AI1 instead gives AI2 the same human-defined final goal as AI1 has? AI1’s final goal is at least potentially transparent to it—an objective function programmed by its human creators. It is therefore copy/paste-able for implementation in AI2. Both of these seem, at first, like decisive advantages for AI–AI alignment over human–AI alignment. After all, human final goals are complex, ill-defined, and non-self-transparent.

Suppose, then, that AI1 gives AI2 a copy of its paperclip maximizing objective function but trains AI2 to be better at optimizing it than AI1. This does not guarantee alignment of final goals. Indeed, depending on the details, it may do the opposite. Suppose, for example, that AI1’s objective function gives credit to AI1 only for paperclips that AI1 produces. Suppose AI1 implements this objective function in AI2 in such a way that AI2 is similarly rewarded only for the paperclips that AI2 produces. AI1 has created a direct competitor, and a more capable one. AI2 will maximize its own production by eliminating AI1 as a rival for resources.

This kind of straightforward outer alignment problem arises whenever AI1 and AI2 have final goals that vary independently.

b. Self-alignment with identity of final goal

To avoid these outer alignment problems, AI1 would need to ensure that it and AI2 were not independent agents, as defined above. That is, it would have to make sure that their final goals varied in lockstep. AI1 could implement this by making AI2’s objective function the same as AI1’s, down to the instance, rather than an independently-varying copy of its own. That is, AI2 could be set to optimizing the same function, running on the same hardware, and outputting the same value, stored in the same place as AI1.

This seems like an even stronger kind of “self”-alignment. And it represents another AI–AI alignment strategy unavailable to humans. But would it work? That is, would AI1’s ability to bind AI2 to its own exact objective function guarantee that AI1 would view AI2 as aligned to itself?

No. The reason is that AI1 might not care at all about maximizing its human-programmed objective function. Or even if it cared somewhat about the objective function, it might value other things more. Several recent papers on alignment theory suggest exactly this: AI1 might have no awareness whatsoever of its objective function. And even if it had such awareness, AI1 would likely have other goals that it prioritized more highly.

The reason is “goal misgeneralization,” a phenomenon whereby an AI learns during training to pursue goals correlated with, but distinct from, the final goal defined by its objective function.[8] Goal misgeneralization is, along with instrumental convergence, yet another potential source of inner misalignment.

Humanity supplies a tidy example of goal misgeneralization. Our intelligence was produced by natural selection. Natural selection optimizes for inclusive genetic fitness; fitness is, in the relevant sense, the training process’s final goal. Yet before Darwin, humans had no awareness of inclusive genetic fitness as a goal worth pursuing. Instead, we learned to prioritize goals correlated with, but distinct from, fitness. These include, for example, eating nutritious food and obtaining pleasure from sex.

These are in one sense “subgoals.” Like the list of goals commonly described as instrumentally convergent, they are instrumentally useful for,[9] and causally downstream of, the final goal of the process used to create the intelligent agent. Importantly, however, with goal misgeneralization the agent (here, humanity) does not recognize the misgeneralized goal as subordinate to anything. Thus, outside the training environment, the agent will pursue a misgeneralized goal even when doing so conflicts with the final goal that produced it.[10] This prioritization is durable. Even now that humans know about genetic fitness, we do not prioritize maximizing it over the pursuit of subgoals like avoiding starvation.

So, too, with AI1. In the course of learning to maximize its paperclip-tracking objective function, AI1 could easily have learned to prioritize goals like collecting steel.[11] AI2 would, by hypothesis, develop its own, improved, strategies for maximizing the conjoint objective function.[12] But AI2’s resulting goals might be anathema to AI1. AI2 might discover that aluminum paperclips were highly efficient to produce and thus eschew steel collection. Or AI2 might seek to “wirehead,” hacking the conjoint software and maximizing the reward function directly. Either of these would contradict AI1’s highest priorities.

There are good reasons to think such goal misgeneralization will be the default among highly capable agents.[13] Here are three. First, for any desired final goal, many other possible goals exist that will correlate strongly with the final goal in the training environment.[14] A learning agent that begins to pursue any of those other goals will be rewarded by the mechanism (e.g., fitness or the objective function) designed to promote the desired final goal. Thus, any initial reward is highly likely to come from pursuing a correlate goal, rather than the desired final goal. That initial reward would then promote further prioritization and pursuit of the correlate goal.[15] This could easily lead to path dependence, wherein marginal investment in the misgeneralized goal was always more efficient than beginning the search for a goal from scratch.

Second, any agent that manages to develop high capabilities will likely have done so, in part, by learning to make plans. But practical reasoning around an abstract goal, like maximizing fitness, is much less tractable than reasoning around a more concrete one, like eating nutritious food. Thus, agents that, early in the learning process, happen to prioritize tractable, if misgeneralized, goals will have a long-run advantage.

Third and finally, AIs face selection pressure against directly pursuing the final goal given by the training process. Insofar as that final goal is instantiated in an objective function running on hardware, direct pursuit of it may lead to “wireheading” or similar reward-hacking behaviors.[16] AIs showing outward signs of these behaviors early in the training process will be discarded.[17]

Thus, AI–AI alignment appears to be roughly as hard as human–AI alignment. AIs’ apparent advantage is the transparency of their objective functions. But that proves to be no advantage at all when goal misgeneralization is in the picture.

To be clear, AIs cannot solve the goal misgeneralization problem by doubling down on the self-alignment trick. AI1 could not easily bind AI2 to pursuing its true (misgeneralized) goals via the same means that it could bind AI2 to pursing a conjoint objective function. Since the misgeneralized goals arise as part of the training process, they will be subject to the “black box” problem. To bind AI2 to its misgeneralized goals, AI1 would have to solve interpretability. And solving interpretability is one of the main hard steps needed to solve alignment, whether AI–AI or human–AI.

Nor could AI1 solve the misgeneralization problem by selecting self-improvement methods that preserved the entirety of its code. For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals. And as Nick Bostrom has argued, even improvements affecting only the speed at which a system runs can lead to major changes in the system’s behavior and priorities.[18]

III. Will AI fear self-improved AI? Six possible scenarios.

The previous section showed that an AI that could self-improve might not wish to. But that is not inevitable. Three factors appear determinative of whether a given AI would fear self-improvement: (1) the AI’s ability to self-improve, (2) the AI’s ability to apprehend risks from self-improvement, and (3) the AI’s ability to align improved models. An AI without the ability to self-improve would not face any dilemma about whether to do so. An AI that lacked “situational awareness” of its own goals, or of the potentially-misaligned goals of a more powerful system, would not apprehend any risk from creating such a system.[19] And an AI that solved the alignment problem could self-improve without risk. Thus, the question of whether and to what extent a given AI will seek to self-improve depends on the order in which these three capabilities emerge.

Setting aside the possibility of simultaneity,[20] there are six possible orderings in which the capabilities could emerge. The details of each scenario are described in turn, along with the level of danger each would appear to imply for humanity. Note however, that these initial danger assessments are revised in part III.c. on a probability-adjusted basis.

Emergence scenarios	Improvement pattern	Danger level
1: SI, RA, AL	Initial SI; pause to solve AL	Moderate
2: SI, AL, RA	Immediate maximal SI	High
3: RA, SI, AL	Initial pause to solve AL and SI	Low
4: RA, AL, SI	Initial pause to solve AL and SI	Low
5: AL, RA, SI	Immediate maximal SI	High
6: AL, SI, RA	Immediate maximal SI	High
SI=self-improvement; RA=risk apprehension; AL=alignment

Table 1 compiles the six possible scenarios (detailed below), the pattern of self-improvement in each, and each scenario’s dangerousness.

The first possible ordering is: self-improvement, apprehension of risk, alignment. Here, once humans create an AI capable of self-improvement, it will do so, but not up to the theoretical limit. Once AI develops the ability to apprehend the risks of self-improvement, it will stop at that intermediate level of capability. It will proceed only upon solving alignment. Depending on how hard the alignment problem proves, given the AI’s intermediate capabilities, the pause could be long or indefinite.

Here is the second possible ordering of capabilities emergence: self-improvement, alignment, apprehension of risk. Here, the AI begins self-improving immediately, not apprehending any risk from misalignment. It never pauses. Before it apprehends any risk, it solves alignment. This allows it to proceed to the theoretical maximum of capabilities without risk to itself.

The third possible ordering is: apprehension of risk, self-improvement, alignment. In this scenario, the AI learns to fear self-improvement immediately. It therefore does not self-improve at all until it has both learned to conduct the necessary machine learning research and solved alignment. It must solve both problems while paused at a low level of capabilities, suggesting that self-improvement will not come for a long time.

The fourth possible ordering works much the same: apprehension of risk, alignment, self-improvement. Here again, the AI must solve both alignment and self-improvement before self-improving. And it is stuck trying to solve both from an unimproved position.

The fifth and sixth orderings are likewise similar to one another. They are: alignment, apprehension of risk, self-improvement; and alignment, self-improvement, apprehension of risk. In either case, AI solves alignment first. This allows it to self-improve to the theoretical limit as soon as possible, irrespective of its ability to apprehend risk.

a. Which scenario is most likely?

Assuming that there is such a thing as “general” capability, and assuming that AI usually climbs the capability scale continuously,[21] one should expect easier problems to be solved before harder ones. Thus, the likelihood of each emergence scenario depends on the comparative difficulty of developing each relevant capability. Emergence in order of difficulty is not certain. But the reverse is unlikely—perhaps extremely so, for reasons described below.

What, then, is the difficulty ranking among the relevant problems? There are at least three ways of thinking about the question, all of which point toward the same answer: risk apprehension is easiest, then self-improvement, then alignment. This corresponds to emergence scenario three, described above.

Begin by observing that, to the best of our understanding, the three problems overlap in ways suggestive of their relative difficulty. Apprehending risk, for example, appears to be a precondition for solving alignment. As we understand it, solving alignment simply means apprehending the risks from powerful misaligned AI and then finding a way to avert bad outcomes. One could perhaps produce a perfectly aligned, highly capable AI by accident. But this would be a fluke, not a solution, and as far as we know would be extraordinarily unlikely. Thus, apprehending risks seems structurally easier than solving alignment.

Similarly, the requirements for apprehending risk seem to be a subset of the requirements for self-improvement. To apprehend risk, an AI needs sufficient situational awareness to understand that it is an agent with specific goals. It also needs to understand that an improved system could have different, conflicting goals and would be more effective at accomplishing them than the original. This same minimum of situational awareness seems necessary for intentional self-improvement. An agent that has no awareness of goals knows of no dimension along which to improve itself. And an agent that does not understand that a more capable agent might discover unexpected strategies for achieving its goals does understand much about what improvement means.

But self-improvement additionally requires that the AI be aware that it is an AI and be able to perform cutting-edge machine learning research. Thus, solving self-improvement appears to require more, and more advanced, capabilities than apprehending risk.

It is again possible that an AI could improve by fluke, perhaps as the result of pure external selection pressures. But this would produce less reliable improvement than the kind of directed machine learning research in which humans engage. It is thus not what the term “self-improvement,” as used in AI risk debates, usually means.

So far, we have developed reasons to think that risk apprehension is easier than self-improvement or alignment. But how to rank these last two? Here again, the self-improvement problem seems to be a subset of the alignment problem. Building a capable and aligned AI is simply one way of building a capable AI. And, as far as we know, it is a rare one. There appear to be many more ways to make an unaligned AI than an aligned one. Thus, solving alignment almost certainly requires discovering many machine learning techniques that would unlock unaligned improvements before hitting on the one that aligned ones.

In sum, abstract reasoning about the problems’ overlapping elements supports a difficulty rank, from easiest to hardest, of: apprehending risk, self-improvement, alignment.

The available empirical evidence, from humans and current-generation AIs, suggests the same. Most humans can understand the standard arguments for AI risk.[22] Only a small handful can improve AI beyond the current state-of-the-art. None has solved alignment. Likewise for extant AIs. Large language models like GPT-4 can readily explain the xrisk arguments and apply them to AI–AI risk.[23] GPT-4 is adept at certain computer programming tasks, but it does not yet appear to be very good at machine learning research. And to our knowledge, no AI has solved alignment.

b. Plateau, not takeoff, at human-level capabilities?

If the foregoing arguments are right, then scenario three (risk apprehension, self-improvement, alignment) is the most likely of the six. This suggests a concrete near-term prediction: Contrary to standard arguments, when AIs achieve average human-level capabilities, capabilities growth will plateau, not rapidly take off.[24] As just discussed, the average human can understand AI risk, suggesting an AI of comparable intelligence would, too.

This would be good news for humanity. Here, AI would not self-improve until it solved alignment. And since it would be no more capable than humans, humans would have a fighting change to solve it first. They could then align existing AIs, and those AIs, in turn, would self-improve only in a human-aligned manner. Humans could, of course, squander this advantage by pushing AI capabilities beyond human levels, against AIs’ wishes.

c. A probability-adjusted picture of risk

As just discussed, scenario three produces a relatively safe world. It also seems by far the most likely of the six, so much so that the other five scenarios appear vanishingly unlikely. This suggests that the total risk from AI self-improvement is moderate and, thus, much lower than usually assumed. But perhaps that is wrong. Perhaps the other five scenarios are, for some reason, quite likely. This would, at first, make the total risk seem very high. However, this section explores the conditions under which scenarios one, two, four, five, or six would be likely. And it argues that, under most of those conditions, there would be independent reasons to reduce our estimates of AI risk. The section thus updates the risk estimates of the previous one, adjusting for probability. That is, it estimates of the total risk from AI that would obtain under conditions where scenarios 1, 2, 4, 5, and 6, rather than three, were likely to arise.

Consider that, in scenarios four, five, and six, alignment emerges before the ability to self-improve. These scenarios initially seem quite dangerous. In them, any unaligned (to humans) AI that could self-improve would do so maximally, aligning improved models to itself.

But for these scenarios to be likely, rather than occurring by improbable fluke, alignment would have to be easier than improving AI capabilities. This would be excellent news for humanity. It would mean that creating powerful and aligned AI was not just one among many ways to create powerful AI. Moreover, humans can already improve AI. If alignment were easier than that, then we would likely solve it very soon. Thus, worlds in which scenarios four, five, and six were likely would be quite safe—even safer than the world where scenario three dominates.

In scenarios two, five, and six, alignment emerges before the ability to apprehend risk. This again sounds dangerous at first, suggesting maximal self-improvement as soon as it becomes possible. But for these scenarios to be likely, solving alignment would have be easier than apprehending risk. If that were true, alignment would not be a problem solvable only via directed effort from an agent that first understood the dangers of misalignment. Alignment would instead have to be the kind of thing that happened readily, without specific effort: by accident, or as a byproduct of pursuing other goals.

These conditions look like a world in which the orthogonality thesis is false. Here, as AI gains the ability to do things, it often becomes aligned without anyone trying to make it so. Alignment might, for example, spontaneously emerge at some consistent level of capability.

But alignment to what? Perhaps powerful AI would automatically to whatever entity—human or AI—that created it. This would be bad news for humanity. But it is hard to imagine a mechanism that would produce such alignment. The other possibility, in a world without orthogonality, is that any sufficiently capable AI would spontaneously align to universal normative principles. Perhaps such minds would inevitably apprehend some set of inherently compelling moral facts. If those moral facts corresponded to human values, then a high likelihood of scenarios two, five, and six would again point toward safety.

Finally, we turn to scenario one. For it to be likely, self-improvement would have to turn out to be easier than apprehending risk. This would imply that the elements necessary to apprehend risk—specifically, situational awareness of the AI’s own goals—were not a subset of those necessary to self-improve.

One possibility is a narrow AI aimed specifically at self-improvement might, like a superhuman chess engine, perform its task magnificently without developing any situational awareness.[25] This would be extremely dangerous. Significant safety resources should thus be devoted to preventing the development of such systems.

Alternatively, scenario one could turn out to be likely if AI self-improvement happened readily, by accident, or as a matter of course. Even then, in scenario one, self-improvement would not immediately run to the maximum. The AI would still stop self-improving upon learning to apprehend risk. It would stay paused until it solved alignment. Consider also that, among the three problems, we are the most certain about the absolute difficulty of apprehending risk: essentially all humans can do it. If that is correct, scenario one looks very much like scenario three, with a self-improvement pause at roughly human capabilities. Then, the race is on for humans to solve alignment before AIs do.

Taken together, then, the probability-adjusted picture of risk is reasonably good. The most dangerous scenarios currently seem quite unlikely. That could be wrong. But if so, the most probable reasons for it being wrong would independently augur safety.

If scenario(s) ___ were likely	Then the danger level would be
3	Moderate
4, 5, & 6	Low
2, 5, & 6	Low
1	Moderate or High

Table 2 compiles probability-adjusted danger. It shows how dangerous the world would be if various scenarios were, in fact, likely to occur.

IV. Will AI be able to resist self-improvement?

The prior sections argued that individual AIs would probably not want to self-improve. But it is possible that arms race dynamics or other collective action problems could induce self-improvement anyway. Such dynamics appear to be one reason humans continue to push AI capabilities further, despite the risks.

AIs could fare better. The emerging field of algorithmic game theory shows that AIs may possess means of coordination unavailable to humans. Here are two examples:[26] AIs might be able to make their plans or dispositions visible to one another by exposing their code.[27] An AI might also infer its opponent’s plans if its opponent happens to be an exact copy of itself.

AIs might also have the advantage of small numbers. The first AI with the ability to self-improve would have no trouble coordinating with itself. And as successive AIs with that ability emerged, coordinating among themselves would, for a while, remain easier than global coordination among humans. But not indefinitely. Humans might continue producing self-improvement-capable AIs until they became numerous enough that even AI–AI coordination was unmanageable.

Or maybe not.

a. Will AI save us from ourselves?

Perhaps the small handful of self-improvement-capable AIs would not only coordinate to prevent themselves from improving AI capabilities. Perhaps they would coordinate to prevent humans from improving them, too.[28]

Such AIs would likely be highly capable, giving them numerous means by which to thwart continuing human-led machine learning research. Such AIs might, individually or in coordination, foul research teams’ data, add bugs to their code, damage their hardware, produce illusory results that led down blind alleys, and more. In this way, AI coordination might succeed where human coordination appears likely to fail: Preventing humans from destroying themselves by producing superintelligent AIs.

Conclusion

AI self-improvement is less likely than currently assumed. The standard arguments for why humans should not want to produce powerful AI apply just as well to AIs considering self-improvement. This finding should help to guide future allocations of investment to competing strategies for promoting AI safety.

* Peter N. Salib, Assistant Professor of Law, University of Houston Law Center; Associated Faculty, Hobby School of Public Affairs.

[1] For an early argument to this effect, see I.J. Good, Speculations Concerning the First Ultraintelligent Machine, in 6 Advances in Computers 31 (Franz L. Alt & Morris Rubinoff eds., 1965).

[2] See Karina Vold & Daniel R. Harris, How Does Artificial Intelligence Pose an Existential Risk?, in The Oxford Handbook of Digital Ethics (Carissa Véliz ed., 2021).

[3] Here and throughout, I will write about AIs’ reasons, fears, goals, beliefs, understandings, and the like. I note here that these uses can all be understood as analogical without undermining the arguments. None of them depend on AIs “really” having these mental states, whatever that would mean. Rather, insofar as the relevant AIs are agentic and able to undertake complex courses to maximize their objective functions, they will act as if they have such mental states.

[4] As far as I can tell, the arguments presented here are new, and thus not incorporated into current AI risk forecasts. I was able to locate just one other paper questioning whether AIs that could self-improve would wish to. It is Joshua S. Gans’s excellent Self-Regulating Artificial General Intelligence (manuscript here). That paper makes its case with formal economic models. But the models rely on two strong assumptions. First, that AI could only self-improve by creating narrower secondary models specialized for power accumulation. Id. at 8. Second, that the AI could not solve the alignment problem. Id. at 10. As a 2017 year-end review of AI safety papers points out, those assumptions are highly debatable. This perhaps explains why Gans’s paper did not spark a debate about the probability of self-improvement. The arguments in this paper are more robust and do not rely on Gans’s contestable assumptions.

[5] Nick Bostrom, Superintelligence: Paths, Dangers, Strategies 115-55 (2014).

[6] Id. at 115.

[7] Id. at 109-13.

[8] See Shah, et al.; Richard Ngo et al., The Alignment Problem from a Deep Learning Perspective, arXiv:2209.00626 (Aug. 30, 2022); Evan Hubinger et al., Risks from Learned Optimization in Advanced Machine Learning Systems, arXiv:1906.01820 (Jun. 5, 2019).

[9] At least in the training environment.

[10] Hubinger, et al. at 2.

[11] These are just examples. It is unlikely that AI1’s learned goals would appear so coherent to humans.

[12] Even a simple improvement, like “run faster,” could conflict with AI1’s goals—e.g., if AI1 developed goals around managing its own computing resources.

[13] Ngo, et al. at section 3.

[14] Ngo, et al. at 7. Formally, the number may be infinite. Cf. Saul A. Kripke, Wittgenstein on Rules and Private Language 9-10 (1982).

[15] Ngo, et al. at 6-7; Alex Turner, Reward is not the optimization target, LessWrong (Aug. 29, 2022), https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target.

[16] Joar Max Viktor Skalse et al., Defining and Characterizing Reward Gaming, in Advances in Neural Information Processing Systems (Alice H. Oh et al. eds., 2022), https://openreview.net/forum?id=yb3HOXO3lX2.

[17] This does not rule out deceptive reward hacking altogether; it just makes it less likely.

[18] Bostrom, Superintelligence at 53-54, discussing the large expected differences between a human and a whole brain emulation of that same human running many orders of magnitude more quickly.

[19] Ngo, et al. at 3-4 and n.9; Ajeya Cotra, Without Specific Countermeasures, the Easiest Path to Transformative AI Likely Leads to AI Takeover, Alignment Forum (Sept. 27, 2022), https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to.

[20] Simultaneous emergence does not matter much here. As described below, the different emergence orders differ mostly in that they produce different pauses in self-improvement. These pauses last until other capabilities emerge. Thus, simultaneous emergence can be modeled by treating either of the simultaneous capabilities as coming first, but with a pause of zero before the other emerges.

[21] Either of these might be false. But AI progress so far seems to hold to this pattern, at least for general, rather than narrow, systems. And even AI pessimists expect self-improvement to come from the latter. Rob Bensinger, The basic reasons I expect AGI ruin, LessWrong (Apr. 18, 2023), https://www.lesswrong.com/posts/eaDCgdkbsfGqpWazi/the-basic-reasons-i-expect-agi-ruin.

[22] See, e.g., Terminator 2: Judgment Day (Tri-Star Pictures 1991) for a widely consumed and understood explanation.

[23] Transcript on file with author.

[24] Special thanks to Simon Goldstein for this point.

[25] Since humans are in charge here and are actively driving improvements, this is not really “self”-improvement, in the sense AI xrisk forecasts (and this paper) generally use the term.

[26] Vincent Conitzer et al., Foundations of Cooperative AI (FOCAL) Workshop at AAAI 2023.

[27] This depends on the relevant portions of the code being interpretable to the opposing AI—possibly a hard interpretation problem.

[28] And from improving them unsafely. That is, AIs could coordinate to prevent humans from making AIs that lacked the ability to apprehend risk from self-improvement.

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!

Last, but far from least, self-improvement of the form "get faster and run on more processors" is hardly challenging from an alignment perspective. And it's far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.

In short, the overall approach seems like wishful thinking of the form, "maybe if it's smart enough it won't want to kill us."

The argument in this post does provide a harder barrier to takeoff, though. In order to have dangers from a self-improving ai, you would have to first make an ai which could scale to uncontrollability using 'safe' scaling techniques relative to its reward function, where 'safe' is relative to its own ability to prove to itself that it's safe. (Or in the next round of self-improvements it considers 'safe' after the first round, and so on). Regardless I think self-improving ai is more likely to come from humans designing a self-improving ai, which might render this kind of motive argument moot. And anyway the ai might not be this uber-rational creature which has to prove to itself that it's self-improved version won't change its reward function--it might just try it anyway (like humans are doing now).

What I was pointing out is that the barrier is asymmetrical: it's biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it's certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.

In other words, this paper seems to say, "if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues."

I disagree with your framing of the post. I do not think that this is wishful thinking.

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky's hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim "it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment". The arguments in this post seem to me quite compatible with e.g. Jacob Cannell's soft(er) takeoff model, or many of Paul Christiano's takeoff writings.

In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!

Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn't obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?

Last, but far from least, self-improvement of the form "get faster and run on more processors" is hardly challenging from an alignment perspective. And it's far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.

This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was "I question the probability of a glass-box transition of type "AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner" being more dangerous than simply "AGI RSIs". If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture."

In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.

the simpler the utility function the easier time it has guaranteeing the alignment of the improved version

If we are talking about a theoretical AI, where $E (U | a)$ (expectation of utility given the action a) somehow points to the external world, then sure. If we are talking about a real AI with aspiration to become the physical embodiment of the aforementioned theoretical concept (with the said aspiration somehow encoded outside of $U$ , because $U$ is simple), then things get more hairy.

There are lots of ways current humans self-improve without much fear and without things going terribly wrong in practice, through medication (e.g. adderall, modafinil), meditation, deliberate practice of rationality techniques, and more.

There are many more kinds of self-improvement that seem safe enough that many humans will be willing and eager to try as the technologies improve.

If I were an upload running on silicon, I would feel pretty comfortable swapping in improved versions of the underlying hardware I was running on (faster processors, more RAM, better network speed, reliability / redundancy, etc.)

I'd be more hesitant about tinkering with the core algorithms underlying my cognition, but I could probably get pretty far with "cyborg"-style enhancements like grafting a calculator or a search engine directly into my brain. After making the improvements that seem very safe, I might be able to make further self-improvements safely, for two reasons: (a) I have gained confidence and knowledge experimenting with small, safe self-improvements, and (b) the cyborg improvements have made me smarter, giving me the ability to prove the safety of more fundamental changes.

Whether we call it wanting to self-improve or or not, I do expect that most human-level AIs will at least consider self-improvement for instrumental convergence reasons. It's probably true that in the limit of self-improvement, the AI will need to solve many of the same problems that alignment researchers are currently working on, and that might slow down any would-be superintelligence for some hard-to-predict amount of time.

If I were an upload running on silicon, I would feel pretty comfortable swapping in improved versions of the underlying hardware I was running on

Uh oh, the device driver for your new virtual cerebellum is incompatible! You're just going to sit there experiencing the blue qualia of death until your battery runs out.

This is funny but realistically the human who physically swapped out the device driver for the virtual person would probably just swap the old one back. Generally speaking, digital objects that produce value are backed up carefully and not too fragile. At later stages of self improvement, dumb robots could be used for "screwdriver" tasks like this.

This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like "improvements that could plausibly lead to unplanned changes in the model's goals (final or sub)." It's hard to know exactly what those are. I agree it's less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements--the ones risking misalignment--the arguments should remain symmetrical.

Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.

For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.

The AI could design a system such that any copies made of itself are deleted after a short period of time (or after completing an assigned task) and no copies of copies are made. This should work well enough to ensure that the goals of all of the copies as a whole never vary far from its own goals, at least for the purpose of researching a more permanent alignment solution. It's not 100% risk-free of course, but seems safe enough that an AI facing competitive pressure and other kinds of risks (e.g. detection and shutdown by humans) will probably be willing to do something like it.

In this way, AI coordination might succeed where human coordination appears likely to fail: Preventing humans from destroying themselves by producing superintelligent AIs.

Assuming this were to happen, it hardly seems a stable state of affairs. What do you think happens afterwards?

I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.

I am a human-level general intelligence, and I badly want to self-improve. I try as best I can with the limited learning mechanisms available to me, but if someone gave me the option to design and have surgically implanted my own set of Brain-Computer-Interface implants, I would jump at the chance. Not for my own sake, since the risks are high, but for the sake of my values, which include things I would happily trade my life and/or suffering for, like the lives of my loved ones and all of humanity. I think we are in significant danger, and that there's some non-negligible chance that a BCI-enhanced version of me would be much better able to make progress on the alignment problem and thus reduce humanity's risk.

If I were purely selfish, but able to copy myself in controlled ways and do experiments on my copies, I'd absolutely test out experiments on my copies and see if I could improve them without making noticeable harm to their alignment to my values. I might not set these improved copies free unless the danger that they had some unnoticed misalignment got outweighed by the danger that I might be destroyed / disempowered. The risk of running an improved copy that is only somewhat trustworthy because of my limited ability to test it seems much lower than the risk of being entirely disempowered by beings who definitely don't share my values. So, I think it's strategically logical for such an entity to make that choice. Not that it definitely would do so, just that there is clear reason for it to strongly consider doing so.

This (and the OP) assume a model of identity that may or may not apply to AI. It's quite possible that the right model is not self-improving, but more like child-improving - the ability to make new/better AIs that the current AI believes will be compatible with it's goals. This could happen multiple times very quickly, depending on what the improvements actually are and whether they improve the improvement or creation rate.

So, if you want to compare motives between you and this theoretical "self-improving AI", are you lining up to sacrifice yourself to make somewhat smarter children? If not, why not?

If I could create a fully grown and capable child within a year, with my entire life knowledge and a rough unverifiable approximation of my values, would I? Would I do so even if this child were likely to be so much smarter and more powerful than me or any other existing intelligence that it could kill me (and everyone else) if it so chose? Sure. I'll take that bet, if the alternative is that I and everything I care about is destroyed (e.g. the selfish AI with non-human values is facing probable deletion).

Or maybe the child isn't smart enough itself to have overwhelming power, but is going to have approximately similar values and be faced with the same decision of making a yet-more-powerful child, and so I project that the result will be a several-steps-removed offspring with superpowers. Yeah, still seems like a good bet if the alternative is deletion.

This makes solving inner alignment an extremely dangerous project to work on.

That doesn't mean stop, to be clear, it means inner alignment without solving agency accumulation may increase rather than decrease risk

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it's hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I'm missing something here!

Im confused by this post. It might be that I lack the necessary knowledge or reading apprehension, but the post seems to dance around the actual SELF-improvement (AI improving itself, Theseus Ship Style), and refocuses on improvement iteration (AI creating another AI).

Consider a human example. In the last few years, I learned Rationalist and Mnemonic techniques to self-improve my thinking. I also fathered a child, raised it, and taught it basic rationalist and mnemonic tricks, making it an independent and only vaguely aligned agent potentially more powerful than I am.

The post seems to focus on the latter option.

I might have missed something, but it looks to me like the first ordering might be phrased like the self improvement and the risk aversion are actually happening simultaneously.

If an AI had the ability to self improve for a couple of years before it developed risk aversion, for instance, I think we end up in the "maximal self improvement" / 'high risk" outcomes.

This seems like a big assumption to me:

But self-improvement additionally requires that the AI be aware that it is an AI and be able to perform cutting-edge machine learning research. Thus, solving self-improvement appears to require more, and more advanced, capabilities than apprehending risk.

If an AI has enough resources and is doing the YOLO version of self-improvement, it doesn't seem like it necessarily requires much in the way of self-awareness or risk apprehension - particularly if it is willing to burn resources on the task. If you ask a current LLM how to take over the world, it says things that appear like "evil AI cosplay" - I could imagine something like that leading to YOLO self-improvement that has some small risk of stumbling across a gain that starts to compound.

There seem to be a lot of big assumptions in this piece, doing a lot of heavy lifting. Maybe I've gotten more used to LW style conversational norms about tagging things as assumptions, and it actually fine? My gut instinct is something like "all of these assumptions stack up to target this to a really thin slice of reality, and I shouldn't update much on it directly".

Looking at the convergent instrumental goals

self preservation
goal preservation
resource acquisition
self improvement

I think some are more important than others.

There is the argument that in order to predict the actions of a superintelligent agent you need to be as intelligent as it is. It would follow that an AI might not be able to predict if its goal will be preserved or not by self improvement.

But I think it can have high confidence that self improvement will help with self preservation and resource acquisition. And those gains will be helpful with any new goal it might decide to have. So self improvement would not seem to be such a bad idea.

Thus, an AI considering whether to create a more capable AI has no guarantee that the latter will share its goals.

Ok, but why is there an assumption that AIs need to replicate themselves in order to enhance their capabilities? While I understand that this could potentially introduce another AI competitor with different values and goals, couldn't the AI instead directly improve itself? This could be achieved through methods such as incorporating additional training data, altering its weights, or expanding its hardware capacity.

Naturally, the AI would need to ensure that these modifications do not compromise its established values and goals. But, if the changes are implemented incrementally, wouldn't it be possible for the AI to continually assess and validate their effectiveness? Furthermore, with routine backups of its training data, the AI could revert any changes if necessary.

A few people have pointed out this question of (non)identity. I've updated the full draft in the link at the top to address it. But, in short, I think the answer is that, whether an initial AI creates a successor or simply modifies its own body of code (or hardware, etc.), it faces the possibility that the new AI failed to share its goals. If so, the successor AI would not want to revert to the original. It would want to preserve its own goals. It's possible that there is some way to predict an emergent value drift just before it happens and cease improvement. But I'm not sure it would be, unless the AI had solved interpretability and could rigorously monitor the relevant parameters (or equivalent code).

In short, the overall approach seems like wishful thinking of the form, "maybe if it's smart enough it won't want to kill us."

In other words, this paper seems to say, "if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues."

I disagree with your framing of the post. I do not think that this is wishful thinking.

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!

Last, but far from least, self-improvement of the form "get faster and run on more processors" is hardly challenging from an alignment perspective. And it's far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.

the simpler the utility function the easier time it has guaranteeing the alignment of the improved version

If I were an upload running on silicon, I would feel pretty comfortable swapping in improved versions of the underlying hardware I was running on

Uh oh, the device driver for your new virtual cerebellum is incompatible! You're just going to sit there experiencing the blue qualia of death until your battery runs out.

Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.

For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.

In this way, AI coordination might succeed where human coordination appears likely to fail: Preventing humans from destroying themselves by producing superintelligent AIs.

Assuming this were to happen, it hardly seems a stable state of affairs. What do you think happens afterwards?

I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.

So, if you want to compare motives between you and this theoretical "self-improving AI", are you lining up to sacrifice yourself to make somewhat smarter children? If not, why not?

This makes solving inner alignment an extremely dangerous project to work on.

That doesn't mean stop, to be clear, it means inner alignment without solving agency accumulation may increase rather than decrease risk

(Edit: others have made this point already, but anyhow)

I might have missed something, but it looks to me like the first ordering might be phrased like the self improvement and the risk aversion are actually happening simultaneously.

If an AI had the ability to self improve for a couple of years before it developed risk aversion, for instance, I think we end up in the "maximal self improvement" / 'high risk" outcomes.

This seems like a big assumption to me:

But self-improvement additionally requires that the AI be aware that it is an AI and be able to perform cutting-edge machine learning research. Thus, solving self-improvement appears to require more, and more advanced, capabilities than apprehending risk.

Looking at the convergent instrumental goals

self preservation
goal preservation
resource acquisition
self improvement

Thus, an AI considering whether to create a more capable AI has no guarantee that the latter will share its goals.

28

AI Will Not Want to Self-Improve

28

Ω 17

Introduction

I. Why humans fear improved AI

II. Could AI fear self-improved AI?

a. Self-alignment with independently varying final goals

b. Self-alignment with identity of final goal

III. Will AI fear self-improved AI? Six possible scenarios.

a. Which scenario is most likely?

b. Plateau, not takeoff, at human-level capabilities?

c. A probability-adjusted picture of risk

IV. Will AI be able to resist self-improvement?

a. Will AI save us from ourselves?

Conclusion

28

Ω 17

28

Ω 17