I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'll need to somehow fool standard law enforcement and general societal scrutiny.
Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions. Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.
Why then are people trying to do ARA evaluations? Well, ARA was originally introduced primarily as a benchmark rather than a threat model. I.e. it's something that roughly correlates with other threat models, but is easier and more concrete to measure. But, predictably, this distinction has been lost in translation. I've discussed this with Paul and he told me he regrets the extent to which people are treating ARA as a threat model in its own right.
Separately, I think the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.
Hmm, I agree that ARA is not that compelling on its own (as a threat model). However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger. And, once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary. Further, it seems somewhat hard to do capabilities evaluations that rule out this self-improvement if models are ARA capable given that there are so possible routes.
Edit: TBC, I think it's scary mostly because we might want to slow down and rogue AIs might then out pace AI labs.
So, I think I basically agree with where you're at overall, but I'd go further than "it's something that roughly correlates with other threat models, but is easier and more concrete to measure" and say "it's a reasonable threshold to use to (somewhat) bound danger " which seems worth noting.
While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources
I agree this is probably an issue for the rogue AIs. Bu...
AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down
Why do you think this? What is the general story you're expecting?
I think it's plausible that humanity takes a very cautious response to AI autonomy, including hunting and shutting down all autonomous AIs — but I don't think the arguments I'm considering justify more than like 70% confidence (I think I'm somewhere around 60%). Some arguments pointing toward "maybe we won't respond sensibly to ARA":
If AI labs are slamming on the recursive self improvement ASAP, it may be that Autonomous Replicating Agents are irrelevant. But that's a "ARA can't destroy the world if AI labs do it first" argument.
ARA may well have more compute than AI labs. Especially if the AI labs are trying to stay within the law, and the ARA is stealing any money/compute that it can hack it's way into. (Which could be >90% of the internet if it's good at hacking. )
...there will be millions of other (potentially misaligned) models being deployed deliberately by humans, includi
they'll need to recursively self-improve at the same rate at which leading AI labs are making progress
But if they have good hacking abilities, couldn't they just steal the progress of leading AI labs? In that case they don't have to have the ability to self-improve.
Thanks for this comment, but I think this might be a bit overconfident.
constantly fighting off the mitigations that humans are using to try to detect them and shut them down.
Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:
Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.
So my main reason for worry personally is that there might be an ARA that is deployed with just the goal of "just clone yourself as much as possible" or a goal similar to this. In this case, the AI does not really have to survive particularly among others as long as it is able to pay for itself and c...
A very good essay. But I have an amendment, which makes it more alarming. Before autonomous replication and adaptation is feasible, non-autonomous replication and adaptation will be feasible. Call it NARA.
If, as you posit, an ARA agent can make at least enough money to pay for its own instantiation, it can presumably make more money than that, which can be collected as profit by its human master. So what we will see is this: somebody starts a company to provide AI services. It is profitable, so they rent an ever-growing amount of cloud compute. They realize they have an ever-growing mass of data about the actual behavior of the AI and the world, so they decide to let their agent learn (“adapt”) in the direction of increased profit. Also, it is a hassle to keep setting up server instances, so they have their AI do some of the work of hiring more cloud services and starting instances of the AI (“reproduce”). Of course they retain enough control to shut down malfunctioning instances; that‘s basic devops (“non-autonomous”).
This may be occurring now. If not now, soon.
This will soak up all the free energy that would otherwise be available to ARA systems. An ARA can only survive in a world where it can be paid to provide services at a higher price than the cost of compute. The existence of an economy of NARA agents will drive down the cost of AI services, and/or drive up the cost of compute, until they are equal. (That‘s a standard economic argument. I can expand it if you like.)
NARAs are slightly less alarming than ARAs, since they are under the legal authority of their corporate management. So before the AI can ascend to alarming levels of power, they must first suborn the management, through payment, persuasion, or blackmail. On the other hand, they’re more alarming because there are no red lines for us to stop them at. All the necessary prerequisites have already occurred in isolation. All that remains is to combine them.
Well, that’s an alarming conclusion. My p(doom) just went up a bit.
I honestly don't think ARA immediately and necessarily leads to overall loss of control. It would in a world that has also widespread robotics. What it would potentially be, however, is a cataclysmic event for the Internet and the digital world, possibly on par with a major solar flare, which is bad enough. Destruction of trust, cryptography, banking system belly up, IoT devices and basically all systems possibly compromised. We'd look at old computers that have been disconnected from the Internet from before the event the way we do at pre-nuclear steel. That's in itself bad and dangerous enough to worry about, and far more plausible than outright extinction scenarios, which require additional steps.
Why do you think we are dropping the ball on ARA?
I think many members of the policy community feel like ARA is "weird" and therefore don't want to bring it up. It's much tamer to talk about CBRN threats and bioweapons. It also requires less knowledge and general competence– explaining ARA and autonomous systems risks is difficult, you get more questions, you're more likely to explain something poorly, etc.
Historically, there was also a fair amount of gatekeeping, where some of the experienced policy people were explicitly discouraging people from being explicit about AGI threat models (this still happens to some degree, but I think the effect is much weaker than it was a year ago.)
With all this in mind, I currently think raising awareness about ARA threat models and AI R&D threat models is one of the most important things for AI comms/policy efforts to get right.
In the status quo, even if the evals go off, I don't think we have laid the intellectual foundation required for policymakers to understand why the evals are dangerous. "Oh interesting– an AI can make copies of itself? A little weird but I guess we make copies of files all the time, shrug." or "Oh wow– AI can help with R&D? That's awesome– seems very exciting for innovation."
I do think there's a potential to lay the intellectual foundation before it's too late, and I think many groups are starting to be more direct/explicit about the "weirder" threat models. Also, I think national security folks have more of a "take things seriously and worry about things even if there isn't clear empirical evidence yet" mentality than ML people. And I think typical policymakers fall somewhere in between.
We need national security people across the world to be speaking publicly about this, otherwise the general discussion on threaats and risks of AI remains gravely incomplete and bias
The problem is that as usual people will worry that the NatSec guys are using the threat to try to slip us the pill of additional surveillance and censorship for political purposes - and they probably won't be entirely wrong. We keep undermining our civilizational toolset by using extreme measures for trivial partisan stuff and that reduces trust.
According to METR, the organization that audited OpenAI, a dozen tasks indicate ARA capabilities.
Small comment, but @Beth Barnes of METR posted on Less Wrong just yesterday to say "We should not be considered to have ‘audited’ GPT-4 or Claude".
This doesn't appear to be a load-bearing point in your post, but would still be good to update the language to be more precise.
If you think we won't die from the first ARA entity, won't it be a valuable warning shot that will improve our odds against a truly deadly AGI?
Might be good to have a dialogue format with other people who agree/disagree to flesh out scenarios and countermeasures
Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP.
Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;
We plan to write more about these with @Épiphanie Gédéon in the future, but first it's necessary to discuss the basic picture a bit more.
Potentially unpopular take, but if you have the skillset to do so, I'd rather you just come up with simple/clear explanations for why ARA is dangerous, what implications this has for AI policy, present these ideas to policymakers, and iterate on your explanations as you start to see why people are confused.
Note also that in the US, the NTIA has been tasked with making recommendations about open-weight models. The deadline for official submissions has ended but I'm pretty confident that if you had something you wanted them to know, you could just email it to them and they'd take a look. My impression is that they're broadly aware of extreme risks from certain kinds of open-sourcing but might benefit from (a) clearer explanations of ARA threat models and (b) specific suggestions for what needs to be done.
Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.
It seems like the Centre for AI Security is a new organization.
I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.