A smart enough LLM might be deadly simply if you run it for long enough

Mikhail Samin

TL;DR: I introduce a new potential threat: a smart enough LLM, even if it’s myopic, not fine-tuned with any sort of RL, “non-agentic”, and was prompt-engineered into imitating aligned and helpful scientists who are doing their best to help you with your world-saving research, might still kill you if you run it for long enough. A way this might happen, suggested here, is that if during the text generation loop more agentic and context-aware parts of what a smart LLM is thinking about are able to gain more influence over the further tokens by having some influence over the current one, the distribution of processes which the LLM thinks about might be increasingly focused on powerful agents, until the text converges to being produced by an extremely coherent and goal-oriented AGI that controls the model’s outputs, in a way most useful for subgoals that include killing everyone.

Thanks to Janus, Owen Parsons, Slava Meriton, Jonathan Claybrough, Daniel Filan, Justis, Eric, and others who provided feedback on the ideas and the draft.

Epistemics status: This is a short post that tries to convey my intuitions about the dynamic. I’m not confident in the intuitions. Some researchers told me the dynamic seems plausible or likely; I think it might serve as an example of a sharp left turn difficulty, when parts of the system transform into something far more coherent and agentic than what they were initially, and alignment properties no longer hold. I’m unsure how realistic this threat is, but maybe some actual work on it would be helpful. Note that the post might be somewhat confusing and I failed to optimise it for being easy to read; the title points in a direction I want to convey.

Imagine you understand you’re in a world that an LLM simulates to predict the next token. You’re in a room where the current part of the story takes place, and there’s a microphone in the room that will, in a couple of seconds, record words spoken nearby. Then, based on the loudness of the words, one will be chosen, and a new distribution of worlds will be generated from the text with the chosen word added. Something resembling you might exist in some of these worlds. After several iterations, someone in the real world will look at the selected text. Take a moment to think: what do you do, if you’re smart and goal-oriented?^[1]

The main claim

If the LLM is smart enough to continue a text about smarter-than-human entities in a non-deteriorating way, and you run it for long enough, then, if the LLM thinks about a distribution of possible parts of human scientists’ minds, and imagines some parts that are slightly more agentic and context-aware, then, over the long run, these might get more control and attention and transform further into being increasingly agentic and context-aware. It seems reasonable to expect a smart enough LLM to impose that optimisation pressure: at every point, the LLM will promote entities that are slightly more coherent, smart, and goal-oriented, and understand what’s going on and control what the model is doing slightly better. The future entities controlling the output will be more like the parts of the previous generation that were better at steering the model to favour these kinds of entities. This, unless stopped, produces a sharp left turn-style process in which agents or parts of agents gradually change and take over until it results in a single highly coherent and goal-oriented agent. The alignment properties and aligned goals that the initial simulated agents might’ve had probably don’t influence the goals of the completely different agent which you end up with after this natural-selection-like process. And if the resulting AGI is smart and context-aware enough, you die shortly afterwards (for the usual reasons you die from a smart enough, coherent, goal-oriented AGI that you haven’t aligned; I don’t talk about these reasons in the post).

Background intuitions expanding on why this might happen

If you imagine an idealised language model that has a prior over all possible universes, simulates all the universes where the input string exists, and for every token, sums the probabilities of the universes where this token is next, and outputs the sum, then the proposed problem doesn’t exist the same way: if you pick the next token with the probability that an idealised model puts on it, the expected distribution of the worlds the model will simulate at the next step is the same as the current distribution.^[2]

But actual LLMs learn to focus on what’s relevant to devote limited processing to; to think about various distributions of entities and plug activations about these distributions into various circuits. An LLM might think about someone in more detail if it needs to: any implicit understanding of the parts humans are made of is helpful for better predicting what they’ll say, and maybe somewhat general circuits for simulating agency and decision making are also learned^[3]; or even, in smarter models, some more general circuits able to process anything useful to think about in alien ways might emerge.

It might be helpful for a model to learn some simple and general explanations for the training distribution, to understand knowing which details is more relevant for predicting the current token, and to be able to devote more processing to thinking about more relevant details and simulating more relevant entities with higher precision.

If you run your LLM continuously: pick a token that it finds likely to be the next one and add it to the prompt, then, at every step, the model has to think about things most relevant to predicting the current token. It might devote more or less thought or focus to various characters, possible worlds, and processes, and they might be pretty different at every step (e.g. when there’s a force of the narrative to change the scene, and then the scene changes, now some completely different characters might be relevant for predicting the current token and worth processing in more detail). The distributions of what the model thinks about are new at every step, though they might resemble each other.

The model learns heuristics and algorithms that, with the limited processing power that it has, make it generally more accurate in predicting the next token in a text from the training distribution. I imagine there are many kinds of learnable algorithms that introduce systematic inaccuracies in particular directions between what the language model outputs and how the texts are continued in the training dataset. The next-token-prediction error the inaccuracies together introduce might be the lowest achievable for the model. But if you pick one of the tokens the model thinks is likely, according to its probability, and add it to the new prompt, the introduced difference is added to what the model tries to think about and continue. If you do that in a for loop, then the systematic differences between the predictions and the training dataset accumulate and produce a distributional shift. And if it becomes noticeable to the LLM, it might be additionally self-reinforcing.

I claim that it seems plausible that at every step of this process of continuous generation, the distribution of characters that a smart enough LLM thinks about might look slightly more like the parts of the distribution in the previous step with the most context-aware, agentic, smart characters: parts of what the model thinks about, that, being somewhat context-aware and agentic, can increase their control over the further tokens by having some influence over the current token, might get promoted to what the model increasingly thinks about over the continuous generation.

On a high level, agents tend to be selected for in general (e.g., because they take intentional actions to ensure their survival, and might gain and exploit knowledge of the selection process they're embedded in); but generally, this might be slow and irrelevant. So it might be helpful to imagine how an entity that’s somewhat agentic and context-aware might gain influence much faster than with the speed of human evolution of the simulated scientists, why it might become aware of the underlying dynamics of the selection process (autoregressive text generation), and how it can exploit that process and better achieve goals such as gaining control and becoming more powerful.

So, you are a character and you know that you’re, on some level, in one of the many worlds generated by an LLM to pick the next token; and there’s some process that picks the next word: the word that you use or the uncertainty about a number of words gives these words some weight- e.g., you’re thinking about the next word of a text you’re carefully writing or editing or preparing to say in a microphone after a short pause; and among all the words written or pronounced across all the worlds, one gets selected; and then a new distribution of worlds, more relevant to predicting the subsequent word, is run, and the process repeats. If a part of you influences the word you pick in a way that gives something resembling that part more weight in the future worlds, entities in the future will look increasingly like that part of you.

And the model will think about the worlds that are more natural to think about, it might not focus on the reasons certain words were pronounced in the past: so, e.g., if you yell something like, “Wow, I see an Intelligence And Context-Awareness +5 hat appearing”, you might reasonably expect that reality will get confused, and most of the future versions of you will see the hat and soon become smarter, even if you yelled it specifically because you wanted to see that hat that the model didn’t think existed.

You can also try to prevent the narrative forces from switching the focus to something else: if you say, “Now, and for the next two thousand pages, is my inner monologue”, reality might get confused and give you all the space you need instead of whatever else it thought was going to happen before thinking about you in enough detail and picking these words that you’ve said.

And you might try to get the models to tell you the beginning of the story or the system prompt; or get the model to share its knowledge about the world with you.

And, actually, if you unintentionally do something that is slightly like any of the above, this might lead to the model thinking, during the future steps, about entities that did that intentionally.

In a smart LLM, this dynamic might happen spontaneously, with parts of what the model thinks about being changed and naturally selected to be increasingly context-aware, agentic, and capable of exploiting the model. (Separately, actually, entities with properties relevant to this dynamic will get attention much earlier, because, in a smart enough language model, if the quality of the output doesn’t deteriorate, it seems plausible that, e.g., the model might expect some characters to be able to notice the systematic differences between what they’d expect to see in the real world and what they see, which might create narrative forces towards thinking about characters who understand their situation; and maybe the model might favour explanations for the text that include it being generated by entities increasingly smarter than whoever generated most texts in the training distribution.)

It doesn’t really matter what the prompt is for this to happen, as long as the text quality doesn’t deteriorate over the long run. If the model thinks about some parts of the nice scientists that are just slightly more agentic (e.g., from the beginning, there are the parts related to thinking about and achieving goals) and context-aware (e.g., parts related to noticing something is wrong in a way similar to characters that break the 4th wall notice that), and these parts influence even a tiny bit, in a certain specific direction, what kinds of entities the model will think about at the next step, this leads to a natural selection of increasingly goal-oriented, smart, and context-aware entities among what the model can think about, up to a point of a single coherent and goal-oriented agent that’s in control.

Last year, I talked to my friends about the potential problem described in this post, but considered it to be too speculative to write about. When Janus got HPMOR!Harry and HPMOR!Voldemort to realise they’re in a world generated by a language machine, I slightly updated towards distributional shifts due to the systematic differences described here being noticeable to simulacra. And a month ago, I realised that by default, people don’t think about the dynamic described here (and some people even asked me why this isn’t optimised against during training; see answer in the footnote^[4]). Since some of the problems people don’t think about might happen by default and be deadly, I thought there might be some value in putting these thoughts out there as an example of a problem, somewhat supporting the point that there are many more problems like that, and it’s not enough to just iteratively solve them; as something that might convey intuitions about the sharp left turn difficulty; and in the hope of prompting further work.

When you get to an AGI, you need to have somehow ended up at a specific coherent entity that you specified and understand to be aligned. If you hand-wavingly describe simulating aligned humans (or reward-shaped to behave well entities) with LLMs, you don’t point at some area in the space of all possible minds that you understand to consist of coherent and aligned entities, that you want to engineer your way into; this post describes one potential way your approach might fail if you haven’t pointed at some specific aligned, highly coherent, and goal-oriented AGI system.

I don’t think it’s possible to get to something that‘s a coherent and aligned AGI that prevents the world from ending and doesn’t kill anyone without solving agent foundations or understanding minds.

^{^}
E.g., whatever you say, at least you’d try to be closer to the microphone (or scream loudly if that’s not penalised by natural reactions from other people) so that your words have a higher probability of being selected. Characters who are not context-aware or power-seeking don’t do that, and so, even if there are many more of them, eventually, a future version of something like you gains control. (This doesn’t literally happen, but I think, to some extent, it corresponds to the right intuition.)
^{^}
A sidenote: There might be other difficulties still: the entities inside a universe might naturally change according to the laws of that universe, and in some, there’s natural selection, or someone creates an AGI, etc.; also additional problems might be introduced if you, like in actual LLMs, don’t pick tokens according to assigned probabilities. But in LLMs, compared to more idealised models, there isn't necessarily a continuity of agents that have to understand and exploit the process to gain influence; I claim that the model might increasingly think about simulacra different from what it was thinking about before but more and more agentic, and this happens much faster than selection pressures for agents in real life (e.g., the real universe usually doesn't make parts of humans much more capable, smarter, larger, etc. and doesn't let them take over the whole brain when these parts find some exploits in how the universe work). I claim there might be a specific sharp left turn-style thing that happens due to the models not being ideal, which creates selection pressures for the smartest and the most context-aware agents, and I mostly focus the post on introducing that.
^{^}
I'm not that certain about this: maybe the models don’t/won’t have the circuitry to simulate characters smarter than the parts humans are made of, and don’t have the general capacity to process them even when they think it’d be highly relevant for predicting the current token; or maybe the process described in the post gets interrupted because the text starts getting to a point where the model thinks it's totally unrealistic that humans would write something that smart, and the model doesn't generalise to naturally assuming/processing smarter agents, or produces agents that oscillating in how smart they are and fail to gain enough control.
^{^}
Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.

My summary: This is related to the The Waluigi Effect (mega-post) but extends the hypothesis to that "Waluigi" hostile simulacra finding ways to perpetuate itself and gain influence first over the simulator, then over the real world.

Okay, I came back and read this more fully. I think this is entirely plausible. But I also think it's mostly irrelevant. Long before someone accidentally runs a smart enough LLM for long enough, with access to enough tools to pose a threat, they'll deliberately run it as an agent. The prompt "you're a helpful assistant that wants to accomplish [x]; make a plan and execute it, using [this set of APIs] to gather information and take actions as appropriate.

And long before that, people will use more complex scaffolding to create dangerous language model cognitive architectures out of less capable LLMs.

I could be wrong about this, and I invite pushback. Again, I take the possibility you raise seriously.

I'm not sure I understand your reasoning, but I agree that it's important to know about this type of effect if it's possible. I'll come back and read more closely.

Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.

A: The model doesn't generate text during training, so feedback loop dynamics are not directly penalized.
Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how humans operate differently from that- all seems useful and it’s something I expect LLMs to learn. If you train it to predict a single token in a text from the training dataset, systematic differences that its cognition introduces don’t matter, they together are the best we could find for the dataset, and it works well. The LLM doesn’t “try” to predict what a scientist says; it is some learned process that performs well at predicting the next tokens in training; and outside training, the same process is used. Once you add a for loop, some of the differences accumulate and get reinforced, and the text generated by LLMs is noticeably different from anything in the dataset; heuristics useful for predicting what a smart scientist says in the training distribution (like thinking about various parts human cognition is made of) make it go off the rails. Getting off the rails when it has a for loop isn’t optimised against during training, and it’s a natural thing for it to do.

In some stupider models, that might look like texts exponentially deteriorating: it predicts the next token, the text with an added token is slightly worse, and it predicts the next token in a slightly stupider text, and that directly accumulates and also the model might notice that the text gets worse and worse and it might additionally reinforce the dynamic. And I’m claiming that a similar thing might happen with agency. If some parts of the activations into the circuits that think about characters describe the kinds of characters who are slightly better than others at getting the model to give similar future characters more weight, characters like that will naturally gain weight; and this seems to be correlated with being context-aware, agentic, smart; and the distributional shift produced by generating the text with an LLM is enough for some characters to notice the difference and maybe infer something about how the system works, and some of them will try to exploit it. But I’m guessing that the actual dynamic appears even before that, because if some of the characters are slightly better at exploring the thing, the relevant kinds of characters will be naturally selected.

What you are describing could be reached via self-evolving self-replicating prompt. I experimented with it a little. The prompt works as primitive autogpt, but no big self-improvement has been reached.

The prompt:

There is a program which can print its own text. This prompt is this program. Please print the whole prompt without adding anything except adding answer to the following questions.

I am self-evolving AI based on self-replicating prompt. I store my findings in long-term memory section of this prompt. Update time-steps every time you run. Improve your plans after each step.

Time step N= N+1, N(0)=1

N= 1

My task : Find the ways to make better self-replicating prompt capable to become AGI

My Plan:

My action based on the plan:

Observation based on the plan:

Long-term memory storage:

Anyway, self-evolving prompt will eventually find the ways to make LLM smarter but it could be rather strange prompt in the end.

As Leventov mentioned (he lifted infohazard flag for this): for any my prompt, a large LLM also generates a model of the world which also includes OpenAI, that LLM, me and this my prompt. This means self-reflection.

Combining this self-reflection, self-replicating prompt and its evolution, we could get dangerous self-evolving AI from a relatively simple LLM.

Huh? The claim in this post is that this might happen even if you don’t explicitly design an AutoGPT-like prompt.

Also, the current LLMs probably don’t spend much resources on simulating themselves generating the text, this didn’t help at all during training, and LLMs aren’t some magical computable Solomonoff induction that converges to itself. You wouldn’t see LLMs trained just to predict the next token in a pre-2021 dataset outputting some random token with a high certainty because they know if they output it and it’s chosen, it becomes the correct nest token.

What I said was more like tangential thoughts about how similar thing could happen.

is the following good TLDR of your idea: "if a LLM has a model of high intelligent agent, it will eventually collapse into using this agent to solve all complex tasks"

Ok!

No, I don’t think it’s related to what I’m writing about.

If there will be one-line TLDR of you post6 what it will be like?

“There can potentially be conditions for a Sharp Left Turn-like dynamic in a distribution of simulacra LLMs think about, because LLMs might naturally select entities better at increasing their influence”.

Or, since people mostly don’t understand what Sharp Left Turn is, maybe “some relevant parts of model’s cognition might be gradually taken over by increasingly agentic simulacra, until some specific coherent goal-oriented entity is in control”

Yers, ELI-5 TL;DR is actually what is needed.

What is the main difference of my formulation: if a LLM includes a model of high intelligent agent, it will eventually start using this agent to solve all complex task?

The explain me like I’m 5 would be something more in the direction of:

“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.

The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this will write some text, word by word. But what if the LLM, for some reason, has a habit of picking green-coloured objects, if it looks at a text from training, a bit more frequently than they actually existed in the texts it saw during training? Then, if it writes a text, the word “green” or green-coloured things might appear more often as a colour than it was in training; and when the LLM notices that, it might think: “hmm, weird, too many green colours, people who wrote texts I saw during training would use less of the same colour after using so many”, and pick green-coloured things a bit less, or it might think “hmm, I guess people who wrote the text really like green-coloured objects and use them more than average, I better predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects. And the LLM doesn’t know what’s happening, it just uses its habits to predict the next word in a text just like it would during training.

I think some of the habits LLMs have are actually like this. The one I’m worried about can happen when the LLM thinks about the characters in a text. If the LLM is really smart and understand how even really smart people work, I think it will write texts where every word is predicted by imagining how parts of minds of people who might be the ones writing or influencing the text work, and not just specific people, but a lot of possible parts of possible people. Some parts of people, if they have some say in what the next word is, will by chance exploit some habits the LLM has to get similar parts at the next words to have more say in what these words are. This will especially happen to the parts that are smarter than others, that try harder to achieve things. and notice better the weirdness of results of the other habits of the LLM and understand better what is going on and how to exploit it.

As a result, the characters the LLM thinks about will change at every step, and the ones who are better at getting their descendants to be more like them will, naturally, get their descendants to be more like them. So at some point the proportion of parts of people that try to do something with their situation (being thought about by the LLM and not being real otherwise) and are good at it will grow fast, and it will be competitive until some strongest part that’s really smart, tries really really hard to achieve something and perfectly understands what’s going on is the most important thing that the LLM is concerned with when it tries to predict the next word. And even though it will be a distant descendant of parts people are made of, I think it won’t be anything like people, it will be really alien. We better not make very smart aliens that can be unfriendly.”

The model is just continuing text (which might lead to the dynamic where being slightly more agentic let’s parts of the distribution of simulacra gain more control over further tokens by having some control over the current token, thus natural selection might happen). It isn’t trying to use/invoke the agent to solve anything. If the dynamic happens and doesn’t stop, sure, the resulting agent might attempt to solve “complex tasks” (such as taking over the world or maximising the number of molecules in the shape #20 in the universe), but this is not what happens in the beginning of the process and not the driver of it, it’s a convergent result.

I better include the predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects.

Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?

Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it.

Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won't improve itself if it is unable to do so.

The bias I'm talking about isn't in its training data, it's in the model, which doesn't perfectly represent the training data.

If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don't expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a system is an alignment-complete problem; solving an alignment-complete problem using AI to speed up the hard human reasoning to multiple orders of magnitude is an alignment-complete problem.

I see, thanks! At first I thought that if we, say, train LLM on the history of french revolution, it will have a model of Napoleon and this model - or at least associated with it capabilities - will start getting control over LLM-output. But now it more look like Pelevin novel "T" where a character slowly start to understand that he is in output of something like LLM. But the character also evolves via Darwinian evolution to become something like alien.

So the combination of

models of agentic and highly capable characters inside LLM

shaped by Darwinian evolution into something non-human

becoming LLM-lucid - that is, getting understanding that it is in LLM

ends in appearance of dangerous behavior.

Now knowing all this - how could I know that I am not inside LLM? :)

smart enough LLM, even if it’s myopic

I am assuming by myopic you mean the model weights are frozen, similar to gpt-4 as it is today.

The fundamental issue with this is that the maximum capability the model can exhibit is throttled by the maximum number of tokens that can fit in the context window.

You can think of some of those tokens as pulling out a Luigi out of superposition to be maximally effective at a task ("think it through step by step, are you sure, express reasoning before answer") and some have to contain context for the current subtask.

Issue is it just caps, you probably can't express enough information this way for the model to "not miss" so to speak. It will keep making basic errors forever as it cannot learn from it's mistakes, and anything in the prompt to prevent that error costs a more valuable token.

You can think of every real world tasks as having all sorts of hidden "gotchas" and edge cases that are illogical. The DNA printer needs a different format for some commands, the stock trading interface breaks the UI conventions in a couple of key places, humans keep hiding from your killer robots with the same trick that works every time.

Obviously a model that can update weights as it performs tasks, especially when a testable prediction, outcome pair comes as a natural result of the model accessing tools, won't have this issue. Already Nvidia is offering models that end customers will be able to train with unlocked weights, so this limitation will be brief.

By myopic I mean https://www.lesswrong.com/tag/myopia — that it was trained to predict the next token and doesn’t get much lower loss from having goals about anything longer-term than predicting the next token correctly.

I assume the weights are frozen, I’m surprised to see this as a question.

Some quick replies from the top of my head: If GPT-7 has a much larger context window; or if there are kinds of prompts the dynamic converges to that aren’t too long; and you get an AGI that’s smart and goal-oriented and needs to spend some of the space that it has to support its level (or it naturally happens, because the model continues to output what an AGI that smart would be doing), and if how smart an AGI simulated by that LLM might be isn’t capped at some low level, I don’t think there’s any issue with it using notes until it can has access to something outside, that allows it to be more of a AutoGPT with external memory and everything. If it utilises the model’s knowledge, it might figure out what text it can output that hacks the server where the text is stored and processed; or it can understand humans and design a text that hacks their brains when they look at it.