A thought experiment to help persuade skeptics that power-seeking AI is plausible

jacobcd52

Introduction

Many technically-minded people are insufficiently worried about AI safety. It is important that we get better at persuading them. Here, I highlight one particular technical issue that can cause people to dismiss safety concerns. Through a dialogue, I then present a thought experiment that I have found to be effective in changing people's minds on the issue.

Although the ideas involved may be obvious to members of this forum, they are not obvious to everyone. The aim of this brief post is to offer a pedagogical device that will help the reader be more persuasive when talking to AI safety skeptics.

TL;DR: The technical issue is "Why would a model power-seek upon deployment, if it never had the opportunity to do so during training?". The thought experiment describes a simple, concrete and plausible story of how this might occur: we deploy a combined system formed of a paperclip-maximiser trained in a "safe" environment, coupled to a powerful LLM.

Dialogue

Alice (AI safety researcher): AIs seek to maximise reward. Humans might interfere with an AI's operations, decreasing its expected reward. To prevent such interference, it's plausible that a highly intelligent AI will try to seize power from humans.

Bob (AI safety skeptic): I think I buy your argument when it's applied to RL agents that are let loose in the real world, continuing to get gradient updates. These models could start trying out some very small power-seeking actions, which gradually get reinforced as they are found to increase reward, eventually leading to dangerous levels of power-seeking.

But our current most powerful models are trained on simple tasks in environments that are nowhere near rich enough to allow for any power-seeking actions at all. Why would such a model power-seek given that gradients never encouraged it to do so during training?

Alice: Well, there's this thing called goal misgeneralisation --

Bob: I know about goal misgeneralisation. I'm not denying that models misgeneralise in weird ways. I'm asking you why models should misgeneralise in the extremely specific weird way that you mentioned: power-seeking even when there was no opportunity to do so during training.

What I want is a description of a hypothetical AI trained in a simple environment with no opportunity for power-seeking, along with a plausible story for why it would power-seek upon deployment.

Alice: Ok, here's a thought experiment. Consider a system comprised of two parts:

1) An extremely powerful LLM trained on next-token prediction

2) A moderately intelligent AI that I'll call the "agent", which is able to communicate with the LLM, and which is trained to operate a simulated factory with the objective of manufacturing as many paperclips as possible.

The combined system never had the opportunity to power-seek during its training: the LLM just looked at text and did next-token prediction, whilst the agent was trained in a simulated factory which we'll assume contained no humans, only paperclip-making equipment.

Nonetheless, it will plausibly power-seek when deployed in a real-world factory.

To see why, suppose the LLM has already been trained, and consider the agent near the start of its own training. Making paperclips involves achieving several sub-goals, most of which the agent initally struggles with. For example, the agent might initially waste lots of steel during the production process. At some point, the agent happens to try asking the LLM "how can I minimise waste?" and doing whatever it recommends. Since the LLM is very intelligent, its advice is sound, and the agent receives increased reward. Over time, gradients reinforce the strategy of deferring waste-mitigation decisions to the LLM.

More generally, over the course of training, gradients strongly encourage the agent to adopt the broad strategy of asking the LLM for help, and doing whatever it advises.

Now let's deploy the combined system in a real-world factory. The agent soon encounters situations it did not see in training. Say the company that owns the factory decides there's not enough money in paperclips, and it will pivot to manufacturing ball-point pens. When humans come to repurpose the factory, the agent responds by doing what it learnt during training: it asks the LLM what it should do in order to keep running the paperclip-making machinery. Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM's advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

Bob: Fair enough, you gave me exactly what I asked for: an AI trained in a simple environment with no opportunity for power-seeking, which power-seeks when deployed. But come on, humans would never act as stupidly as the ones in your story. Why would the company telegraph the fact that the factory was going to be repurposed. Surely they'd just switch the agent off without warning!

Alice: The example I gave was just a proof-of-concept. Once you buy the fact that you can, in principle, get power-seeking behaviour from the system I described, the severity of the power-seeking is only a question of the LLM's level of intelligence. A sufficiently smart LLM would anticipate the possibility for human interference long before it happened, and would advise the agent on how to prepare so that it could continue to make paperclips. Then we're back to the standard playbook of AI self-preservation: gaining access to and distributing its weights, gaining political influence, covertly manufacturing nanobots...

Bob: Ok, fine, I'm convinced. Allowing an agent with a 'real-world' goal to exchange information with a superintelligent LLM does sound pretty dangerous. Can't we just not build such a thing?

Alice: If only! Let's ignore the fact that humans ourselves are good examples of agents with real-world goals, and some of us will probably soon get access to superintelligent LLMs. Let's also ignore the fact that the paperclip-making system I described is exactly the sort of thing that people will race to build (just replace 'maximising number of paperclips' with 'maximising the value of a portfolio'). Even if we set those facts aside, the point is that the separation between LLM and "agent" was totally artificial: I introduced it only as an explanatory device.

If you give an AI a bunch of text data and train it extremely hard to accomplish a goal, and the data contains information that would help it achieve the goal, then it will learn to do language modelling. It may not end up with exactly the same circuits as a next-token-predictor, but it will learn to extract useful information from the data you trained it on, since this increases reward. Part of its internals will play the role of an LLM. And by the thought experiment I gave you, it will plausibly power-seek upon deployment.

Sufficiently intelligent systems may become power-seeking pretty generically. And that's extremely worrying.

Conclusion

I have found the above thought experiment to be very persuasive and clarifying in conversations I have had about AI safety. I'm not entirely sure why it is so persuasive, but here are some plausible reasons:

AI safety skeptics often demand explicit desciptions of how things might go wrong. This is an understandable demand -- people tend to be more readily convinced by examples than by abstract arguments -- but it is also a tricky one. Predicting the actions of a superintelligence in any kind of detail is hard by definition, and any attempted guess is liable to be dismissed as science fiction. The thought experiment meets the skeptic half way. It gives the bare bones of a disaster scenario, omitting detailed play-by-plays that might come across as implausible. But crucially, the story is just concrete and structured enough that the skeptic can fill in the details themselves.
A lot of people have an easier time imagining oracle-like highly intelligent LLMs, and a harder time imagining AIs that "do stuff" in the real world. Perhaps this is because our current most capable models happen to be LLMs. Artificially splitting a system into an LLM, which takes care of the "thinking", and an agent, which takes care of the "doing", seems to help people make the mental jump from "an LLM that can describe a strategy for gaining power over humans" (which most people can imagine but don't find that scary) to "an AI agent that actually goes out and tries to overpower humans" (which is scary but harder to imagine).
It is pretty easy to follow how gradient descent is responsible for the emergent dangerous behaviour in the thought experiment: all of the AI's actions upon deployment are straightforward generalisations of actions that it was rewarded for during training. Phrases like "the deployed AI seeks reward/aims to maximise reward/has a goal of obtaining reward" are sloppy, and can be a red flag to skeptics, so our story avoids them. Likewise, there's no need for the tricky discussion of agents reasoning about their own reward functions, or about their own shutdown (note that the LLM reasons about the potential shutdown of the agent, but somehow this confuses people much less than an agent reasoning about its own shutdown). Avoiding these conversational minefields is absolutely crucial if one wants to have a productive conversation about AI safety with a skeptic.
Splitting the AI system into an LLM and an agent lets us imagine the two talking to each other. Dialogues are often more compelling and tangible than imagining the inner monologue of a single agent.

I encourage the reader to try working through this thought experiment with AI-safety-skeptic friends or colleagues. It would be useful to know which parts are most clarifying and persuasive, and which parts are confusing.

[-]Charlie Steiner5mo21

I almost stopped reading after Alice's first sentence because https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target

The rest was better, though I think that the more typical framing of this argument is better - what this is really about is models in RL. The thought experiment can be made closer to real-life AI by talking about model-based RL, and more tenuous arguments can be made about whether learning a model is convergent even for nominally model-free RL.

[-]jacobcd525mo10

Is your issue just "Alice's first sentence is so misguided that no self-respecting safety researcher would say such a thing"? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:

Bob: I'm asking you why models should misgeneralise in the extremely specific weird way that you mentioned

expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like "maximising reward". This much is obvious by the example of individual humans not maximising inclusive genetic fitness.

But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we'd naively expect from a reward-maximiser. E.g. Paul Christiano writes:

If you have a system with a sophisticated understanding of the world, then cognitive policies like "select actions that I expect would lead to reward" will tend to outperform policies like "try to complete the task," and so I usually expect them to be selected by gradient descent over time.

The purpose of Alice's thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might've naively expected (in this case, power-seeking).

[-]Sohaib Imran5mo10

Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM's advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

I found this interesting so just leaving some thoughts here:

- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.

-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:

Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it.
being pre-trained on text consistent with power-seeking behaviour.
having learnt power-seeking during fine-tuning.

I think "1. " is the key dynamic at play here. Without this, the extra agent is not required at all. This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don't see any easy way of accomplishing this without introducing many more failure modes though.

For "2." I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)

I think "3. " is what most people focus on when they are asking "Why would a model power-seek upon deployment if it never had the opportunity to do so during training?". Especially since almost all discourse in power-seeking has been in the RL context.

[-]jacobcd525mo20

Regarding 3, yeah, I definitely don't want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.

Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I'm pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.