These are my reasons to worry less about loss of control over LLM-based agents

[-]StanislavKrym2mo10

The shift from "let's build AGI and hope it takes over benevolently" to "let's build AGI that doesn't want to take over" represents a fundamental change in approach that makes catastrophic outcomes, in my opinion, less likely.

This shift brings with itself other risks like the Intelligence Curse.

[-]otto.barten2mo10

Someone or something will always be in power. If that entity decides to not allocate any resources to most humans, true, we die. But that could have happened in an AI takeover scenario as well, depending on whose values etc. would have been in there.

[-]StanislavKrym2mo10

However, I'm split on whether LLM-based agents, if and when they start working well, could make this a reality. There are a few reasons why I was afraid, and I'm less afraid now, which I will go over in this post. Note that this doesn't tell us anything about the chance of loss of control from non-LLM (or vastly improved LLM -- italics mine -- S.K.) agents, such as the brain in a box in a basement scenario. The latter is now a large source of my p(doom) probability mass.

To me this looks like the "no true Scotsman" fallacy. The AI-2027 forecast relies on Agents since Agent-3 thinking in neuralese and becoming undetectably misaligned. Similarly, LLM agents' intelligence arguably didn't max out at "approximately smart human level", it will max out at a spiked capabilities profile which is likely already known to include the ability to receive a gold medal on the IMO 2025 and to assist with research, but not known to include the ability to handle long contexts as well as humans.

Regarding the context, I made two comments explaining my reasons to think that current-state LLMs don't have enough attention to handle long contexts. The context is in fact distilled into a vector of few numbers, then the vector is processed and the next token is added wherever the LLM decides.

[-]otto.barten2mo10

I agree AI intelligence is and likely will remain spiky and some spikes are above human-level (of course a calculator also spikes above human-level). But I'm as of yet not convinced that the whole LLM-based intelligence spectrum will max out above takeover-level. But I'd be open for arguments.

Intelligence ceiling: AGI, not ASI

I previously worried that AI intelligence would reach levels far beyond human capability, but LLM agent intelligence seems to max out at approximately smart human level, even with hallucinations fixed. This means we can probably rule out Yudkowsky-style takeover models, where AI would invent new science, such as nanobots, that would give it a decisive strategic advantage. If LLM agents are going to invent new science (and I'd say that's likely in itself), it will probably be in a controlled and slow environment with reasonably aligned models, and the results would likely be communicated to humans, rather than used directly for a takeover. A roughly AGI intelligence ceiling fundamentally changes the threat landscape.

Self-improvement limitations

Although I don't believe self-improvement is required for loss of control, it does make it a lot more likely. And LLM-based self-improvement seems hard. Although scaffolding could be self-improving, the base LLM would need new training runs, taking weeks to months and access to significant resources. Therefore, I think LLM improvement would most likely happen with human buy-in, not autonomously. Also, data and compute seem more relevant LLM limitations today than researcher intelligence. Finally, if the whole paradigm maxes out at human-level AI, self-improvement would not be able to add anything beyond, as long as it's not inventing a new paradigm. (I think that's possible, but so can humans, and I'd file that under the brain in a box in a basement scenario mentioned before).

Lab intentions: no takeover planned anymore

About five years ago, people working on AGI implied that once they were finished building AGI, they would not attempt to restrict it (since this would be in vain), but rather were ok with their AI taking over. This would not be bad, according to them, since their AI would ideally be value-aligned, follow coherent extrapolated volition (CEV), or be intent-aligned to themselves, supposedly benign actors. Although actual work on either of these alignment strategies (particularly value alignment and CEV) has never really panned out, this remained roughly the plan at the time, to my knowledge.

Personally, I think a takeover of value-aligned AI, CEV AI, or CEO-intent-aligned AI would all likely lead to a catastrophe, which is why I thought labs' intentions were worrying.

Today, however, labs mostly seem to try to align their models to not take over in the first place, which they now apparently consider achievable. I think this is a much healthier situation, and it decreases my probability of doom. The shift from "let's build AGI and hope it takes over benevolently" to "let's build AGI that doesn't want to take over" represents a fundamental change in approach that makes catastrophic outcomes, in my opinion, less likely.

Safety removal: hopefully only possible by few actors

I used to worry that any safety measures could easily be removed by a bad or careless actor, leading to a Murphy's law situation where work on safety is essentially irrelevant in a multipolar world, since someone, somewhere, will run the AI without whatever safety you've bolted on. In the LLM paradigm, however, AI functions more as an imitator than a completely independent thinker. Key questions remain, though. Can an LLM agent go off track consistently over many steps, even without the user trying to achieve this? What about if the user does try to achieve it with jailbreaks? How likely is an LLM-based agent to propagate the jailbreak to next steps? Beyond jailbreaks, can a bad actor create an LLM that wants to take over through fine-tuning? Or would they need to do pretraining all over again? Hopefully, achieving takeover-able AI would require doing base model pre-training from scratch, using reward signals specifically optimized for takeover capabilities. While entities such as militaries might be interested in such capabilities, this would hopefully require sustained effort from well-resourced actors, markedly improving our odds.

Conclusion

My probability of LLM-based agent takeover is not zero, but it is significantly smaller than it would have been had we had a paradigm which didn't max out at roughly human level, was self-improvable, and would take off in hours to days. It also got smaller since I think most AI companies would try to avoid a takeover, rather than aim for a supposedly aligned version of it.

I do think that in the end, someone will invent a paradigm that captures the remaining advantages of the human brain, and this could lead to a brain in a box in a basement scenario and ensuing foom. My p(doom) therefore shifted somewhat to non-LLM approaches.

Generally, I still think it's crucial to inform the societal debate about AI xrisk. I'm still a huge fan of more AI regulation aimed at reducing xrisk. And I still hope we could move to a regime of conscious extinction risk management that seems far off now.

LESSWRONG
LW