The goal of this post is to start a collaborative analysis. Please post papers or links or other evidence I have missed in comments. I want to get some of Lesswrong's collective mind focused on this question.


A core component of the classical case for AI risk is the potential for AGI models to recursively self-improve (RSI) and hence dramatically increase in capabilities once some threshold of intelligence is passed. Specifically, it is argued that an only somewhat superhuman AGI will rapidly be able to bootstrap itself into an exceptionally powerful superintelligence able to take over the world and impose its values on us against all opposition. 

A lot of the debates on fast vs slow takeoffs hinge on the feasibility and dynamics of the process of RSI, as well as do many potential counters to AI risk. If strong takeoff is inevitable, then strategies like boxing, impact regularization, myopia, and human-in-the-loop auditing and interpretability and so on are intrinsically doomed, since the AGI will simply become too powerful too quickly and can thus break out of any box or outwit any human operators. In such world, iterative development of AGI safety techniques is doomed since the first AGI we build will immediately explode in capabilities. To survive in such a world, we need to design a foolproof solution to alignment before building the first AGIs. 

On the other hand, if RSI is difficult and slow, then there is a chance that these techniques will succeed at containing any proto-AGIs that we build and prevent them from bootstrapping themselves to superintelligence. This doesn't necessarily mean that AI safety is solved, just that the world will likely look much more like a slow takeoff where various AGIs start participating in the economy and slowly accruing power or giving it to various human actors if alignment is at least partially solved. It also means that humanity as a whole have a lot of shots at alignment so that we can use iterative techniques to discover failures empirically and then correct them.

There is also a world where strong RSI is possible and relatively easy, but that containment techniques like boxing are powerful enough to have a good chance of containing near-future proto-AGIs. In this case, we end up in a few-shot alignment world where we can contain and experiment upon AGIs for a small number of experiments or for some length of time but that eventually either an AGI will escape its boxed containment or some other unaligned AGI is built by some group that does not care much about safety or is careless. This means that we still have to solve the strong case of alignment, but can make a few experiments on a timelimit.

Understanding which world we are in has significant implications both for our p(doom) and also for the strategies we can use to get to alignment. For instance, if the moment we reach AGI it starts recursively rewriting its own source-code and retraining itself so as to rapidly become a superintelligence then both boxing techniques and interpretability will likely be fruitless. On the other hand, research here could be very valuable if RSI is slow or near-term AGIs can be preventing from FOOMing.


Probably the best current writeup about the possibilities of RSI remains Intelligence Explosion Microeconomics which uses a simple multiplicative model borrowed from nuclear fission to model intelligence. In this model, everything depends on the critical value of expansion  which is how many neutrons each neutron collision results in. If , then each collision absorbs one neutron and produces more than 1, leading to an exponential explosion in neutron count. If we directly analogize this process to an intelligence explosion, we observe that the critical parameter  in this context is simply the cognitive returns on intelligence -- i.e. how much more intelligence you can produce by gaining an additional unit of intelligence.

If the returns  are  -- i.e. obtaining one unit of 'intelligence' makes you better able to increase your intelligence by more than one unit, then there is an exponential growth in intelligence and an intelligence explosion. If returns are , then each unit of intelligence becomes harder and harder to obtain and there is a smooth exponentialy slow convergence to some maximum intelligence level -- an intelligence fizzle. If returns are exactly 1 then there is a steady linear growth in intelligence -- an intelligence combustion. Of course sustained returns of exactly 1 are vanishingly unlikely in practice but it is possible that the average returns are 1 over some relevant range[1].

In the original report, there are a number of arguments which try to estimate or bound  using various considerations. However, these are almost all outside-view considerations based either on human evolution, or general algorithms from computer science. This is because at the time (2013), modern ML was in its infancy. However, nearly a decade later, and after stunning successes in ML, I suspect there is a lot more evidence around now that we can use to create a more bounded inside-view model of near-future AGI if it can be built with current ML techniques. Updating the RSI model and our estimates of  seems extremely important given the centrality of RSI to AI risk models as well as to alignment strategy.

Here are some potential areas where we now have more information than before thanks to ML research. I would love to here about others or get more analysis on these.
 

Scaling with compute and data: This is perhaps where we have made the most progress since 2013. We now have well-established scaling laws for how ML performance scales with data and compute. While there are different scaling laws proposed -- first Kaplan et al, now Chinchilla and we will likely see further refinements over the coming years. However, all the scaling laws agree on a power-law shape which is strongly sublinear -- thus implying that k<1. That is, it requires ever more compute and data for a linear increase in performance. The primary uncertainty here is to what degree lower loss is associated with greater ability to access more compute. If it is less than power-law, however, then we have k<1. This means it appears unlikely we will see rapid takeoff solely by scaling existing models. It is possible though that scaling might cause a model to cross some capabilities threshold that enable more powerful RSI to occur.
 

Another important point is that the scaling laws we have are almost all for self-supervised language model training and how other ML models and paradigms perform is far more uncertain. We have a fair bit of early work for scaling laws in RL, as well as RLHF,  and the earliest scaling laws paper found power-laws in vision and audio recognition as well as text. However, there are many regimes, including ones especially relevant for alignment -- such as model-based planning, and RL self-play that we have no extant scaling laws for. Given, however, that both Deepmind and OpenAI both started out with and then appear to have dropped scaling RL in favour of scaling language models, this is weak evidence to me that the scaling is worse not better than self-supervised learning. I would be very happy to be corrected here though by someone with more information about why these orgs made these strategic decisions.

An important point about the scaling laws we have found so far is that their scaling coefficients appear very close to constant across orders of magnitude of scale, which means we can make pretty accurate extrapolative predictions about new models. This is a great property, if it holds, since it means that  is likely approximately constant around the human intelligence level and makes it seem unlikely (but not impossible) that we will suddenly veer into a realm of sudden increases of  just above human level. If this scaling property holds for more direct forms of RSI then we can get good estimates of  in smaller non-dangerous models and use that to predict the RSI-capacity of near-term AGI.

Finetuning on self-generated data: This year a bunch of papers came out demonstrating self-improvement in large language models. The typical way this works is you use a language model to produce generations in some enhanced way -- for instance with chain of throught prompting, or by asking it to critique its own responses and then fine-tune it on the generations. This is a rudimentary form of self-improvement and there is no reason it cannot be done iteratively -- i.e. generate then finetune then re-generate then re-finetune and so on. I don't know of any work that has looked at the returns per iteration you get if you do this -- my guess (65% confidence) is that you get diminishing returns in general but there could be clever ways around this. The performance of iterative finetuning methods in general seems likely to provide us important evidence about RSI, as it is perhaps one of the simplest self-improvement techniques.

Algorithmic self-improvement: This is perhaps the case where, as far as I know, current ML has had the least success and which we know the least about. There have been various attempts at automated architecture search as well as meta-learning better optimizers but as far as I know none of them have been super successful -- i.e. there are no automatically designed ML methods in widespread current usage. Current ML tools are getting pretty good at coding now, however, so perhaps it is not far off we will be using ML tools to code better ML tools. In this case we need to try to figure out the scaling laws ASAP to estimate .

Does anybody know of any more evidence from current ML about this? Things I would be especially interested in include:

1.) Scaling laws for other ML tasks and/or architectures other than language modelling and transformers beyond what is cited here.

2.) More examples of scaling laws or guarantees for RL agents, especially those including model-based planners. How well does self-play like in alpha-star scale?

3.) Scaling laws for LLM self-improvement behaviour?

4.) Any kind of scaling for things like neural architecture search or meta-learning in general?

5.) Recent evidence on any of the arguments in the original microeconomics paper. For instance, there is now more arguments/evidence that the difference between homo-sapiens and other species is driven primarily by increased compute/hardware than algorithmic innovations. Also, that returns on cognitive investment might still be large since brain-size could be bottlenecked by things like cooling rather than pure ATP.

6.) Better theoretical models for RSI than just simple criticality.
 

  1. ^

    Heterogeneity of returns to intelligence. Technically, in this model, we assume that the returns on intelligence are fixed -- i.e.  is a constant. It is more likely that the returns on intelligence are themselves a function of the intelligence level itself, as well as the environment  that the agent is embedded in, and its goals  -- i.e. . Clearly, if the environment can be easily 'solved' with simple intelligence -- such as a game of tic-tac-toe then the returns on greater intelligence is 0. Similarly, if the goals of the agent are trivial to achieve at some intelligence level then the returns on further intelligence improvements are also very low. For the moment, let's take the goals and environment out of consideration by assuming that the environment is the maximally complex one of the whole universe and the agent's goals are sufficiently challenging -- i.e. take over the universe to build paperclips -- that additional intelligence will be clearly helpful for the forseeable future. Further, let's assume there is also some maximum valuable intelligence level, even if it might be a Kardashev 3-level literal galaxy-brain. This implies that at some point, even if it is very high, the returns on intelligence decrease below 1. What we really care about are the returns on intelligence in the near-human to slightly superhuman level likely to be reached by current AI techniques in the next 20 years or so, since these are the levels relevant to our success at alignment and the course of the future.
     

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 3:11 PM

Misaligned wrapper-mind optimization was a popular early worry because it's possible in principle and seems convergently useful for all sorts of things, a plausible selection outcome. It became less relevant recently because it seems more likely to happen some time after much more anthropomorphic human imitating language model AGIs have decisive influence over the world, so it's what they would need to worry about instead.

Something similar seems to be the case for recursive self-improvement. Language models already seem capable enough in principle, but insufficiently sane/agentic to act coherently in an autonomous manner. So any AI risk relevant self-improvement is not about increase in straightforwardly definable capability, it's about tuning models towards sanity. Algorithmic self-improvement is something that happens automatically after that point, and doesn't seem either plausible or necessary before.

Nitpick: the article seems to suggest that if RSI is possible, then strong takeoff is inevitable, and boxing would not work - but isn't boxing a potential approach for slowing down the RSI (e.g. each iteration of RSI is only executed once unboxed by a human - at least until/unless boxing fails), and therefore might still work?

Yes, this is the few-shot alignment world described in the post. I agree that in principle if boxing could completely halt RSI then that would be fantastic but that especially with each iteration of RSI there is some probability that the box will fail and we would then get unbounded RSI. This means we would get effectively a few 'shots' to align our boxed AGI before we die.

I have thoughts about RSI, but mostly unsubstantiated hunches. I am doing some research to try to test my hypotheses, but I don't wish to discuss my specific experiments for socio-hazard reasons. My hunch is that we are in an a world where:

  1. RSI is rapid and easy above a certain threshold. Foom not fizzle.
  2. RSI is preventable by preemptive safety precautions and testing, like boxing.
  3. We are already in a situation of compute and data overhang, and that algorithmic breakthroughs can unlock sudden jumps in capabilities.

Personally, I am broadly in agreement with most of these points and especially 2, which seems very understudied given its likely importance to our survival. Would love to chat privately about your thoughts and hunches if you'd be up for it. 

In the original report, there are a number of arguments which try to estimate or bound  using various considerations. However, these are almost all outside-view considerations based either on human evolution, or general algorithms from computer science. This is because at the time (2013), modern ML was in its infancy. However, nearly a decade later, and after stunning successes in ML, I suspect there is a lot more evidence around now that we can use to create a more bounded inside-view model of near-future AGI if it can be built with current ML techniques. Updating the RSI model and our estimates of  seems extremely important given the centrality of RSI to AI risk models as well as to alignment strategy.

I don't think it's particularly important, let alone "extremely" so. 

 

I'm sceptical that RSI is particularly relevant to the deep learning paradigm.

Strongly upvoted. I have multiple disagreements with this post (expressed in my other comments) but nonetheless found it very valuable as a contribution to the LW discussion about AI takeoff dynamics.

I was particularly delighted by the mention of the empirical evidence we have from scaling laws as they bear on marginal returns to cognitive capabilities from increased investment of computational resources.

If the returns  are  -- i.e. obtaining one unit of 'intelligence' makes you better able to increase your intelligence by more than one unit, then there is an exponential growth in intelligence and an intelligence explosion. If returns are , then each unit of intelligence becomes harder and harder to obtain and there is a smooth exponentialy slow convergence to some maximum intelligence level -- an intelligence fizzle. If returns are exactly 1 then there is a steady linear growth in intelligence -- an intelligence combustion. Of course sustained returns of exactly 1 are vanishingly unlikely in practice but it is possible that the average returns are 1 over some relevant range[1].

Another problem with this model, is that it's highly likely that returns to intelligence vary across different cognitive domains. To a first approximation, the cognitive domains relevant to us are:

  • Very subhuman
  • Subhuman
  • Near human
  • Par human
  • Peak human
  • Superhuman
  • Strongly superhuman

 

I see no compelling reason to apriori expect returns to intelligence to behave smoothly across the aforementioned domains, instead of being described by different curves in different domains. At least I expect that will be true for some task/problem domain of interest.

Altogether, I'm very dissatisfied with "Intelligence Explosion Microeconomics" and it seems very spherical cow esque.

Yes definitely. Pretty much the main regions of interest to us are from Par-human up. Returns are almost definitely not consistent across scales. But what really matters for Xrisk is whether they are positive or negative around current or near-future ML models -- i.e. can existing models or AGIs we create in the next few decades self improve to super intelligence or not?