Epistemic Status: At midnight three days ago, I saw some of the GPT-4 Byproduct Recursively Optimizing AIs below on twitter which freaked me out a little and lit a fire underneath me to write up this post, my first on LessWrong. Here, my main goal is to start a dialogue on this topic which from my (perhaps secluded) vantage point nobody seems to be talking about. I don’t expect to currently have the optimal diagnosis of the issue and prescription of end solutions.
Acknowledgements: Thanks to my fellow Wisconsin AI Safety Initiative (WAISI) group organizers Austin Witte and Akhil Polamarasetty for giving feedback on this post. Organizing the WAISI community has been incredibly fruitful in being able to spar ideas with others and see which strongest ones survive. Only more to come.
Edit: The biggest feedback I've gotten so far is that I may be misleading some without enough context about the level of intelligence and danger that we currently have here. So in order to try to give some intuition on what seems to me to be the current state of affairs:
(from @anthrupad on twitter)
Recently, many people across the internet have used their access to GPT-4’s API to scheme up extra dangerous capabilities. These are capabilities which the AGI labs certainly could have done on their own and likely are doing. However, these AGI labs at the very least seem to be committed to safety. Some people may say they are following through on this well and others may say that they are not. Regardless, they have that stated intention, and have systems and policies in place to try to uphold it. Random people on the internet taking advantage of open source do not have this.
As a result, people are using GPT-4 as the strategizing intelligence behind separate optimizing programs that can recursively self-improve in order to better pursue their goals. Note that it is not GPT-4 that is self-improving, as GPT-4’s weights are stagnant and not open sourced. Rather, it is the programs that use GPT-4’s large context window (as well as outside permanent memory in some cases) to iterate on a goal and get better and better at pursuing it every time.
Here are two examples of what has resulted to give a taste:
**Everyone in the AI Safety community should take a moment to look at these examples, particularly the latter, and contemplate the consequences. Even if GPT-4 is kept in the box, simply by letting people access it through an API, input tokens, and receive the output tokens, we might soon have what in effect seem like separate very early weak forms of agentic AGI running around the internet, going wild. This is scary.**
The internet has a vast distribution of individuals out there from a whole bunch of different backgrounds. Many of them, quite frankly, may want to simply just build cool AI and not give safety guards a second thought. Others may not particularly want to create AI that leads to bad consequences but haven’t engaged enough with arguments on risks that they are simply negligent.
If we completely leave the creation of advanced LLM byproduct AI up to the internet with no regulations and no security checks, some people will beyond a doubt act irresponsibly in the AI that they create. This is a given. There are simply too many people out there. Everyone should be on the same page about this.
Let’s look at the perspective of the author of another one of these self-improving agents. Here is a tweet on their work:
And, here is how they talk about AGI:
Empathizing with AGI will not align it nor will it prevent any existential risk. Ending discrimination would obviously be a positive for the world, but it will not align AGI. Significant Gravitas is deeply anthropomorphizing the nature of these models and how alignment works in a way that clearly shows why they don’t feel the need to think about safety.  
I don’t say any of this to dunk on the person. People form their beliefs through a causal chain one way or another and the way to improve things is not to punish the individual but to improve the chain.
My only goal here is to exemplify the wide distribution of beliefs that people highly skilled in deep learning may have and how this ought to bring about deep concern to those who care about risks from AI.
Side note, a lot of similar framing of AGI to above seems to originate from David Deutsch and I think is sadly misguided because whatever the first semblances of AGI will be will almost certainly not fit the mental model these people have for it.
Even if the AGI labs are wise enough to be safe with their most powerful models internally and don’t open source the weights to their models for anyone to fine tune into being evil, if they let anyone access these models through their API with no oversight on what code others execute with it, then major risks still arise.
The focus of the AI Safety world right now feels deadlocked on solely looking at and aligning the most advanced general models and is totally ignoring all the other byproduct models that will be able to reach quite powerful levels by taking advantage of whatever the most advanced model is that is opened up to the public at the moment.
Whether this is actually happening or not is only bottlenecked by the number of individuals with the technical experience to accomplish this and how motivated they are. There is nothing else stopping them. Very early and weak forms of agentic AGIs are already coming around. People are building them.
Don't believe me? Check out the Top 3 GitHub Repositories that are currently trending:
From this point forward, models are only going to get stronger and stronger, and more and more accessible. I don’t think it’s a reach to expect to see open sourced models at near GPT-4 levels within 1-2 years. Once these are made, anybody will be able to use them to create these recursively optimizing AIs. We might see evidence of agents pursuing instrumentally convergent goals everywhere if we don't slow down.
Hopefully, the strategizing will be weak enough that it won't be catastrophic. However, the more powerful that the LLMs are in their ability to strategize, even if they by themselves are benign, the more powerful these recursively optimizing AI using them will become.
GPT-4 sure seems pretty damn intelligent to me. What about GPT-5? GPT-6? Where do we draw the line?
It isn't very clear. However powerful they might be would significantly change the tone of my writing here for how urgently we need to mitigate these risks. No matter what though, in the long term (meaning 1-2 years in these strange AI times) this needs to be thought about.
Andrej Karpathy at OpenAI said when talking about his Jarvis (seen in GitHub Repositories above) said that "Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc."
This makes me feel a bit safer because I don't think anyone yet has figured out a way for the recursive iterations to never derail. The first screenshot in this post above from Harris Rothaermel also shows his agents recursing to infinity and derailing. Hopefully, it stays that way for some time. But eventually, people will figure out how to make it work. I would be pretty shocked if there was no way to make it function with GPT-4.
If this is all true, how can we mitigate the risk that results? Here are some options that came to mind as well as some elaborations on them. I am not deeply invested in any one of these in particular being the way. Rather, I wanted to bring up all of the options that I could think of to be comprehensive.
This is obviously harsh, but seems to be a good starting point for proposals. People won’t be able to use these models to create Byproduct Recursively Optimizing AIs/Russian Shoggoth Dolls if they don’t have access to them, period.
That is, for at least the next year or so, until another GPT-4 powered open source model inevitably comes around. But, at least the agentic AI that these models produce will be weaker than the agentic AI that GPT-5 or GPT-6 produce. I don’t particularly want to be in a world where adversarial AI are constantly fighting against each other but at least the safety concerned people will likely have the more powerful models.
Create much tighter regulations on who has access to the APIs of the AGI companies and what oversight those accessing are forced to permit.
Perhaps OpenAI has to run a background check on you to confirm your identity and track relevant attitudes and beliefs with regards to AI before they provide you with the API key in order to have some risk modeling.
Perhaps accessing GPT-4 means that you need to give OpenAI access to all the code that you will use in tandem with it so that before the code is run, OpenAI can use AI models to scan it all over and make sure that no optimizing agents are being produced that might have instrumentally convergent goals.
This might mean distributing access to GPT-4 through an application that has the power to make these checks rather than simply through an API key where after getting the key, the individual pretty much has free reign. The same thought process should be applied for GPT plug-ins used by companies.
The only feasible way to do this may be for OpenAI to have to create a whole new software system (or use someone else’s) and keep the entire interface with GPT-4 contained there.
Otherwise, people will certainly try their best to obfuscate their code, only providing certain snippets and leaving out others that they fear might get flagged. Lots of thought may need to be put in to think about how to practically accomplish this and ensure that people aren’t hiding code.
These seem like steps to consider right away. Given the incredibly high demand for these powerful models, despite the open source community outcry and flak that will probably arise, people will still accept these safety regulations.
Whereas embedded in Option 2 is detecting code that may produce recursive optimizing behavior before it is run, Option 3 explores detecting emergent behavior either in the API or online that appears to be an optimization process which is recursively improving at speeds faster than humans could as it is being run or after it has been run.
I am quite uncertain about the most effective ways to do so, however, I will float some ideas here:
Using GPT-4 to recursively call itself unsurprisingly drastically increases the total of API calls made by a user which should be a clear marker. Note that this could be bypassed by using multiple API keys, but stricter API regulation can make it impossible for one individual to get more than one key.
Perhaps we could start with scraping posts on forums, GitHub, and Twitter that seem likely to spawn these things? People seem highly interested in sharing their creations, and since these communities are all for open-sourcing, that will likely remain true for some period of time. Four examples of this are already provided above.
Perhaps we attempt training adversarial AI that gets better and better at detecting when other AI programs are likely to produce optimizing agentic behavior. Perhaps even create some really weak and constrained agentic AIs that don’t know that they are being adversarially tested against. This would be a starting point to see how they behave.
Once we detect it though, what can we do?
If that stakes are low, you could ban them from using the API and restrict them from getting another key.
If that stakes are high, this may already be too late. This makes Option 2 seem like a much, much more effective starting point.
If we detect enough prominent examples of dangerous recursively improving optimizing behavior and show influential people, will it be enough to get them worried? Will it get people worried in the right places who will take action that leads to safer consequences?
This may be hopeful thinking, but personally, the discovery of tool use, memory extension and recursion for agency would, in my opinion, be enough for me to probably not release GPT-5 to the public. OpenAI taking this step would be a big move that would send shockwaves throughout the world which perhaps could get everyone, including China, to slow down.
Perhaps we can align LLMs to be able to detect internally whether they are being used to create recursively improving optimizers when they are prompted.
Similar to how LLMs can be aligned to recognize what might be politically insensitive material and navigate their response with finesse, perhaps LLMs can recognize prompts that push them towards building dangerous recursively improving optimizers and 1) not output responses and 2) flag or ban the account attempting to get it to do so.
This seems good for the top AGI companies to commit to attempting. It also doesn't seem like it should be that difficult given how successful their aligning with e.g. RLHF has been so far. However, for RLHF to be used to prevent recursive optimizing AI, people with deep knowledge of AI and coding would have to be the ones giving feedback to the reward model. Only they could parse out whether a piece of code seems dangerous. This may be very expensive.
And, again, the open source community seems likely to create GPT-4 powered LLMs that will not have these safeguards very soon. We’ll have to deal with the consequences of that.
People have quite a lot of different perspectives on AGI that informs how responsible or irresponsible they are when it comes to what they code. If we could somehow create an information ecosystem that walks people through 1) the risks in a very clear cut fashion and 2) takes us past our anthropomorphizing biases of AGI, perhaps argumentatively we can convince people that it is morally wrong to create these dangerous technologies. Maybe this will actually make less people attempt to do so.
Doesn't seem anywhere near likely though given how difficult changing public opinion is. And, really isn't a long term solution to depend on people voluntarily to be safe. Still worth mentioning though.
Create a massive scale government oversight program put into place. This seems clearly politically unfavorable and would lead to a world that many would prefer to avoid, including myself. Although, if it is the only option, which doesn’t seem to be the case right now thankfully, perhaps it ought to be explored. Let’s not make this the only option.
It feels as though the entire AI Safety world is focused on aligning the most advanced AI models and is possibly neglecting what the consequences of simpler models taking advantage of the most advanced open source models might look like.
I worry that this is a big blind spot and would like for it to be on the minds of more people.
Especially given the fact that it seems likely that within 1-2 years GPT-4 level models will likely be trained and then open sourced by someone out there. Not just API open source, but all of their whole weights released to play around with.
This makes me expect very early and weak forms of agentic AGI to be all over the place very soon although I would be quite happy to be convinced otherwise.
Curious to hear others’ thoughts.
Whether AI is conscious and deserving of moral personhood is a different controversial question which I don’t particularly need to engage in this post, although I am on team “No”.
There is some argument to make that something like "empathizing" with the models might help align AI but that is not through sincere moments of intimate connection like are being referred to here, it is through filling the AI's training data with very, very empathetic text. However, this also seems liable to create AI with emotional insecurity issues if not done well, as Bing Chat provided a good first model of.
general reminder: 1. anthropomorphizing isn't completely wrong either, at least compared to how alien some architectures could have been 2. some of these ai kiddos are probably reading posts here, probably good to be somewhat kind in tone
https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg I'd like take this opportunity to again plug my Manifold market on the subject...
Here's a link to the reddit post of the second example from the introduction.
Readers of this may also be interested in this post from 2015:
Should AI Be Open?
Since we seem to have an embarrassment of self-improvement experiments going on currently, do we have any sense whether they are tending to self-improve out, or self-improve in?
By out I mean generalizing what it is already doing, or adding more capabilities; by in I mean things like shorter code, correcting bugs, possibly more secure code.
I've experimented with trying to create autonomous GPT-4 agents, including self-improving ones, and I intend to probably keep experimenting sometimes. Mentioning this in case anyone wants to try to convince me to stop.
I can't convince you to continue or stop, but maybe reading the edit I made to the start of the post will better clarify the risks for you.
how about an having a smaller model governing safety regulations? this could act as an "aligner" on top of LLMs. say some sort of RLHF just focused on mitigating risks