I want to try to go over some of the objections imagine people are having. I don't think I fully understand the opposing viewpoint here, so hopefully it will be clarified for me in the comments.


1. LLMs are not truly generating novel ideas, they are just interpolating between existing ideas via memorized statistical patterns.
       - I think this is true to some degree. I don't think that this prevents those interpolations from being testable hypotheses which turn out to be useful. I think there's enough scientific literature and enough relevant open-source code available on the internet that remixing and integrating will be sufficient for the first few cycles of improvement. And after that, perhaps the resulting LLM++ will be better able to devise truely novel insights.
2. It is too hard to come up with any improvements on LLMs. Lots of people have tried and failed.
       - In Sam Altman's interview with Lex Fridman, Sam says that there have been lots of small improvements between GPT-3 and GPT-4. The improvements don't have to each be huge leaps in order to make a noticeable difference as they accumulate.

3. LLMs hallucinate too much. They are unreliable and much of the code they produce doesn't work.
- My observation has been that when I ask GPT-4 for code relating to an easy and common problem, the sort of thing talked about in a lot of online beginner coding tutorials, it does great. Something like a 95% success rate. When I ask GPT-4 for something novel and challenging, about a specific edge case at which I've already tried and failed to do an internet search for the answer, then it does poorly. Around a 5% success rate. 
       - I expect that coming up with a set of code suggestions which result in enough successes to constitute a cycle of self-improvement will be even harder. I'm guessing something like a 0.1% success rate. I think this is sufficient for success if you have automated the process and can afford to run the process enough to generate and test millions of possibilities. This is a largely parallelizable process, so it doesn't necessarily take much wall clock time.

4. LLMs are currently insufficiently agentic to spontaneously do this.
       - I agree. I'm discussing a large well-funded effort by the owners of a SotA LLM who have access to the source code and weights. I expect the process itself to need lots of human development to get started.

5. Initial attempts from API-users putting LLMs into agentic wrappers (e.g. AutoGPT, BabyAGI) don't seem to have made any progress.
       - I would not expect those attempts to work, and their failures so far thus update me nearly not at all against the possibility of RSI. The process I would expect to work would look more like a substantial engineering effort by the controllers of the LLM, with costs of millions of dollars. This effort would involve generating a wide variety of prompt templates that get applied in turn to every potentially relevant academic paper and/or open source code repository ever published. There would be prompts about summarizing and extracting potentially relevant information. Then the next step would be prompts about generating code. The wrapper system then checks if the code compiles and seems to run without error. If there are errors, feeding those back in and asking for debugging. If no error, then testing training small toy models on a wide variety of small test datasets to see the effects on small scale training runs. If the effects seem at least a little promising, testing on medium scale. If the effects seem promising there, testing at large scale. All this testing of ideas that were 99.9% bad would require a lot of compute. The compute costs plus the prompt-engineering and building the automation process is where I expect the millions of dollars of costs to come from. I would expect this upfront cost to amortise over time, since the prompt engineering and automation work needs to be done mostly just for the beginning of the process.  The testing process could itself be improved over time to be less wasteful, catching more of the bad ideas early on before proceeding to the expensive medium or large scale testing.
       - The process would also have prompt engineering done to attempt to improve the process itself. Prompts asking meta-questions intended to improve or diversify the original prompts. Prompts pointed at improving the testing process to filter out more bad ideas during the cheap phase of testing. More such approaches I haven't thought of yet.

I'm not sure if this process would work yet with GPT-4, but it might. I have not seen evidence of a coordinated effort of the scale that would lead me to believe that this has been ruled out. My guess is that GPT-4 is very close to able to function as part of the described system, but not quite good enough. That is why I am betting that GPT-5 will be good enough.

Please let me know in the comments what I'm missing or getting wrong about the beliefs of people who think that GPT-5 won't be sufficiently capable to be of use in this described process.

[Note: in my Manifold market I describe a scenario using less engineering effort and less compute, because I wanted to make it more of a balanced probability. In this post I describe an easier case, and I feel like 95% confident it would work for at least a few rounds of improvement. Would it taper off after that or accelerate? Unclear.)


New Comment
22 comments, sorted by Click to highlight new comments since: Today at 5:18 AM

I’m guessing something like a 0.1% success rate. I think this is sufficient for success if you have automated the process and can afford to run the process enough to generate and test millions of possibilities. This is a largely parallelizable process, so it doesn’t necessarily take much wall clock time.

How much compute would it take to test a million of these in parallel? I assume you're imagining something less compute-intensive than retraining a million GPTs from scratch, but I'm unclear how much less compute-intensive.

How much evidence does it need per instance to figure out whether the change is an improvement? With a 0.1% success rate, it doesn't take much imperfection in the evaluations for most apparent improvements to be false positives.

We have no idea whether or not GPT-4 is self-improving. From the outside, it's possible that GPT-4 already produces training data that are used for fine-tuning GPT-4.

Given how little OpenAI has said about the step that lead to training GPT-4, I doubt they would tell us if they use GPT-5 in a way where it self-improves.

Your bet seems to assume that there will be public knowledge about whether or not GPT-5 is self-improving when it gets released. It's unclear to me why we would expect that to be public knowledge.

Yes, that is true, it could be difficult to resolve. One way it might resolve is if the improvement is sufficiently rapid and dramatic, such that it is obvious even from the outside. Another way is if they do decide to come out and say that they have done this. Or if ARC tests for this and declares it to be the case. Or perhaps some other as yet unforseen way. But yeah, there's a chance the market has to resolve NA for being undeterminable.

I predict yes, in the following narrow sense: I think a system backed by GPT-5 and granted access to the right APIs will be capable of fully automatically making changes to a repository of that system's code, and then deploy those changes to instantiate a new instance of itself.

I think GPT-4 is slightly too weak to do this, though I think it's not out of the question that people eventually get it working for somewhat trivial / restrictive cases.

GitHub is currently working on or already testing things like Copilot for Pull Requests and Copilot for your codebase. It's not that much of a stretch to imagine hooking these together into a fully automated pull-request authoring system.

An even harder task is then pointing such a system at the codebase(s) and infrastructure used to train the underlying transformer model, make improvements, and then kick off a new training run. I think GPT-5 and maybe even GPT-4 (suitably connected) could make trivial improvements and minor bugfixes to such a repo. Fully autonomously supervising a multi-million dollar training run might be a stretch, but I'm definitely not confident that it is ruled out.

Both you and Peter have pointed out that one of the cruxes here is how much compute is needed for testing. I agree that if the process could come up with algorithmic improvements so weak and subtle that the advantage could only be clearly distinguished at the scale of a full multimillion dollar training run, then RSI would likely not take off. I expect though that the process I describe would find strong improvements, which would be obvious at a 100k param run, and continue showing clear advantage at 1 million, 10 million, 100 million, 1 billion, 10 billion, etc. In that case, the extrapolation becomes a safe bet, and the compute needed for parallel testing is much lower since you only need to test the small models to figure out what is worth scaling.

  1. Initial attempts from API-users putting LLMs into agentic wrappers (e.g. AutoGPT, BabyAGI) don't seem to have made any progress.
  • I would not expect those attempts to work, and their failures don't update me at all against the possibility of RSI.

If the failures of those things to work don't update you against RSI, then if they succeed that can't update you towards the possibility of RSI.

I personally would not be that surprised, even taking into account the failures of the first month or two, if someone manages to throw together something vaguely semi-functional in that direction, and if the vaguely semi-functional version can suggest improvements to itself that sometimes help. Does your model of the world exclude that possibility?

A glider flies, but without self-propulsion it doesn't go very far. Would seeing a glider land before traveling long distance update you against the possibility of fixed-wing flight working? It might, but it needn't. Someone comes along and adds an engine and propeller and all of a sudden the thing can really fly. With the addition of one extra component you update all the way to fixed-wing flight works.

It's the same thing here. Maybe these current systems are relatively good analogues of what will later be RSI-ing AGI and all they're missing right now is an engine and propeller. If someone comes along and adds a propeller and engine and gets them really flying in some basic way, then it's perfectly reasonable to update toward that possibility.

(Someone please correct me if my logic is wrong here.) 

If I had never seen a glider before, I would think there was a nonzero chance that it could travel a long distance without self-propulsion. So if someone runs the experiment of "see if you can travel a long distance with a fixed wing glider and no other innovations", I could either observe that it works, or observe that it doesn't.

If you can travel a long distance without propulsion, that obviously updates me very far in the direction of "fixed-wing flight works".

So by conservation of expected evidence, observing that a glider with no propulsion doesn't make it very far has to update me at least slightly in the direction of "fixed-wing flight does not work". Because otherwise I would expect to update in the direction of "fixed-wing flight works" no matter what observation I made.

Note that OP said "does not update me at all" not "does not update me very much" -- and the use of the language "update me" implies the strong "in a bayesian evidence sense" meaning of the words -- this is not a nit I would have picked if OP had said "I don't find the failures of autogpt and friends to self-improve to be at all convincing that RSI is impossible".


I agree with @awg . I think that a clumsy incomplete attempt at RSI failing which I expect to fail based on my model of the situation is very little evidence that a strong attempt would fail. I think that seeing the clumsy incomplete attempt succeed is strong evidence that the problem is easier than I thought it was, and that RSI is definitely possible. It is me realizing I made a mistake, but the mistake is

 "oh, my complicated ideas weren't even needed, RSI was even easier than I thought. Now I can be totally confident that RSI is near-term possible instead of just pretty sure." 


 "Huh, didn't work exactly in the way I predicted, guess I know nothing at all about the world now, thus I can't say if RSI is possible."

Also, the statement that I'm making is that the current state of attempts of RSI via AutoGPT/BabyAGI are weak and incomplete. Obviously a bunch of people are putting in work to improve them. I don't know what those attempts will look like a year from now. I slightly suspect that there is secret RSI work going on in the major labs, and those highly-competent well-resourced well-coordinated teams will beat the enthusiast amateurs to the punch. I'm not highly confident in that prediction though.

If you had said "very little evidence" I would not have objected. But if there are several possible observations which update you towards RSI being plausible, and no observations that update you against RSI being plausible, something has gone wrong.

Oh, there are lots of observations which update me against RSI being plausible. I have a list in fact, of specific experiments I would like to see done which would convince me that RSI is much harder than I expect and not a near-term worry. I'm not going to discuss that list however, because I don't have a safe way to do so. So there absolutely are pieces of evidence which would sway me, they just aren't 'evidence that RSI is easier than I expect'. Such evidence would convince me that RSI is easy, not that it is impossible.

Hm, I think I'm still failing to communicate this clearly.

RSI might be practical, or it might not be practical. If it is practical, it might be trivial, or it might be non-trivial.

If, prior to AutoGPT and friends, you had assigned 10% to "RSI is trivial", and you make an observation of whether RSI is trivial, you should expect that

  • 10% of the time, you observe that RSI is trivial. You update to 100% to "RSI is trivial", 0% "RSI is practical but not trivial", 0% "RSI is impractical".
  • 90% of the time, you observe that RSI is not trivial. You update to 0% "RSI is trivial", 67% "RSI is practical but not trivial", 33% "RSI is impractical".

By "does your model exclude the possibility of RSI-through-hacking-an-agent-together-out-of-LLMs", I mean the following: prior to someone first hacking together AutoGPT, you thought that there was less than a 10% chance that something like that would work to do the task of "make and test changes to its own architecture, and keep the ones that worked" well enough to be able to do that task better.

Assigning 10% seems like a lot in the context of this question, even for purposes of an example.

What if you had assigned less than 0.01% to "RSI is so trivial that the first kludged loop to GPT-4 by an external user without access to the code or weights would successfully self-improve"? It would have been at least that surprising to me if it had worked.

Failure to achieve it was not surprising at all, in the sense that any update I made from this would be completely swamped by the noise in such an estimate, and definitely not worth the cognitive effort to consciously carry it through to any future estimates of RSI plausibility in general.

What if you had assigned less than 0.01% to "RSI is so trivial that the first kludged loop to GPT-4 by an external user without access to the code or weights would successfully self-improve"?

I would think you were massively overconfident in that. I don't think you could make 10,000 predictions like that and only be wrong once (for a sense of intuition, that's like making one prediction per hour, 8 hours per day, 5 days a week for 5 years, and being wrong once).

Unless you mean "recursively self-improve all the way to godhood" instead of "recursively self-improve to the point where it would discover things as hard as the first improvement it found in like 10% as much time as it took originally".

For reference for why I did give at least 10% to "the dumbest possible approach will work to get meaningful improvement" -- humans spent many thousands of years not developing much technology at all, and then, a few thousand years ago, suddenly started doing agriculture and building cities and inventing tools. The difference between "humans do agriculture" and "humans who don't" isn't pure genetics -- humans came to the Americas over 20,000 years ago, agriculture has only been around for about 10,000 of those 20,000 years, and yet there were fairly advanced agricultural civilizations in the Americas thousands of years ago. Which says to me that, for humans at least, most of our ability to do impressive things comes from our ability to accumulate a bunch of tricks that work over time, and communicate those tricks to others.

So if it turned out that "the core of effectiveness for a language model is to make a dumb wrapper script and the ability to invoke copies of itself with a different wrapper script, that's enough for it to close the gap between the capabilities of the base language model and the capabilities of something as smart as the base language model but as coherent as a human", I would have been slightly surprised, but not surprised enough that I could have made 10 predictions like that and only been wrong about one of them. Certainly not 100 or 10,000 predictions like that.

Edit: Keep in mind that the dumbest possible approach of "define a JSON file that describes the tool and ensure that that JSON file has a link to detailed API docs does work for teaching GPT-4 how to use tools.

My estimate is based on the structure of the problem and the entity trying to solve it. I'm not treating it as some black-box instance of "the dumbest thing can work". I agree that the latter types of problem should be assigned more than 0.01%.

I already knew quite a lot about GPT-4's strengths and weaknesses, and about the problem domain it needs to operate in for self-improvement to take place. If I were a completely uneducated layman from 1900 (or even from 2000, probably) then a probability of 10% or more might be reasonable.

Thanks @JBlack, your comment describes my point of view as well.

Ah yes, I see what you mean. This seems like trivial semantic nitpicking to me, but I will go ahead and update the wording of the sentence to allow for the fact that I had some tiny amount of belief that a very crude AutoGPT approach would work and thus seeing it not immediately work means that my overall beliefs were infinitesimally altered by this.

Yeah. I had thought that you used the wording "don't update me at all" instead of "aren't at all convincing to me" because you meant something precise that was not captured by the fuzzier language. But on reflection it's probably just that language like "updating" is part of the vernacular here now.

Sorry, I had meant that to be a one-off side note, not a whole thing.

The bit I actually was surprised by was that you seem to think there was very little chance that the crude approach could have worked. In my model of the world, "the simplest thing that could possibly work" ends up working a substantial amount of the time. If your model of the world says the approach of "just piling more hacks and heuristics on top of AutoGPT-on-top-of-GPT4 will get it to the point where it can come up with additional helpful hacks and heuristics that further improve its capabilities" almost certainly won't work that's a bold and interesting advance prediction in my book.

MY guess at whether GPT-4 can self-improve at all with a lot of carefully engineered external systems and access to its own source code and weights is a great deal higher than that AutoGPT would self-improve. The failure of AutoGPT says nothing[1] to me about that.

  1. ^

    In the usual sense of not being anywhere near worth the effort to include it in any future credences.

LLMs are truly generating novel ideas, they are just interpolating between existing ideas via memorized statistical patterns.

Is this intended to be "LLMs are not truly generating novel ideas"?