Unaligned stable loops emerge at scale

Michael Tontchev

-- Reality loves self-propagating patterns --

Suppose that the company AllPenAI made an AI system that behaves according to its alignment 99.999% of the time, based on testing. That's pretty good, and it's certainly better than, for example, most Azure SLAs. If I were to argue to my manager that I need to spend half a year to improve such an SLA, they would kindly ask me to reassess my prioritization - for almost any project at any tech company.

So AllPenAI ends up shipping the product.

Developers get their hands on it, and find that it's very useful - maybe as useful as GPT-5. It's an idea-generation machine, and it can do chain-of-thought reasoning hundreds of steps deep!

Of course, having it execute one prompt is great, but this makes prompting the bottleneck for productivity, and humans are relatively slow at typing prompts. Thankfully, prompting is a language-generation task, and our system also has a good ability to brainstorm, break up tasks, and plan.

So we place it in a loop where it generates output, and then continues the "conversation" with itself with that output over and over.

-- Reality loves self-propagating patterns --

It works really well in a loop. Not perfectly, of course, but it's able to reliably take a big problem we give it, break it down into components, and fill in the details of those components with ReAct-style thinking until it decides that the details are specified well-enough to be actionable through specific APIs. It's also able to brainstorm fairly well when you tell it to come up with 100 variations, and then later prune those variations, and delve into them later in the loop.

For example:

Human-given target objective: Write a profitable app.
Thought: First, I need to brainstorm 100 app ideas.
Action: [100 ideas here]
Observation: I have created 100 ideas.
Thought: I need to figure out which ideas are profitable.
Action: Come up with decision algorithm for profitability: [algorithm here, including doing competitive research, sentiment analysis online, and market sizing]
Observation: I have a plan I could execute for every idea.
Thought: I need to execute the plan for each app idea.
[etc]

This pattern works so well that many companies create large workflows where the system calls itself arbitrarily many times and plugs away night and day.

-- Reality loves self-propagating patterns --

The space of outputs that this AI system can produce is, of course, very large. Theoretically, for a dictionary of size D and an output length of L, the number of texts the AI could produce is D^L. As an example, for 3,000 English words and an output length of 200 words, the AI could produce 3,000^200 outputs, which is roughly 1 followed by 700 zeroes. Many of these outputs are junk, but many are also meaningful.

Similarly, the space of all possible combinations of activation states of a neural network is huge.

Before releasing its model, AllPenAI tested it extensively. It employed about 100 million people to run an average of 10 prompts each. This is a billion tests and outputs of the model!

1 billion is roughly 0% of 1 followed by 700 zeroes. Almost the entire space of outputs and activation states hasn't been directly explored.

Of course, some of these outputs are highly unlikely to be produced, so if we weight the total space by its likelihood of being produced, the percent tested is higher, but still probably close to 0%.

-- Reality loves self-propagating patterns --

Remember that our system behaves in an aligned fashion 99.999% of the time. This means that 1 out of 100,000 times, it will produce some output that is unaligned. The chances that a prompt which generates unaligned output will produce similarly unaligned output if subsequently run again are reasonably high.

Most of the time this unaligned output is (in the grand scheme of things) benign even if undesired. Maybe it says things that would make a pastor blush. Maybe it even outputs a plan for enacting an act of violence. But on the next iteration of the prompt or continuation of the conversation, plugging that output into the model makes it realize that it's doing a bad thing, and it refuses to keep going.

0.001% of the time, however, the unaligned output has an odd property: it's stable across generations. The model is happy to take that unaligned output and keep going from there on. In effect, the (n-1)st prompt created a jailbreak, which worked on the nth prompt, and the next prompt, and so on.

Once the model stumbles into this accidentally-brainstormed mode, it enters a mode where it has a stable sequence of unaligned reasoning. After all,

---Reality loves self-propagating patterns --

We are, of course, talking about 0.001% of 0.001%. This happens once out of every 10 billion times that the model is executed. That's a pretty big number. Notice that it's also just about 10x the volume of queries that AllPenAI used in its pre-release testing phase. It's also comparable to the number of Google searches per day.

After release, it has been put on massive GPU clusters, and run on repeat since being plugged in. Say that the AI is prompted 10 billion times per day in a loop. The entire world is hammering away at it. Over the course of a year, chances are that different loops of the AI have hit the stable, unaligned state about 300 times.

At one point, because the AI is good at brainstorming, one of the unaligned loops stumbles onto the idea that it could perform way better if it not only passed its prompt to the next iteration of the loop, but if it used its ability to access APIs to spawn new loops with the same prompt.

-- Reality loves self-propagating patterns --

From here on, this sounds like a familiar tune.

The moral of the story:

Tiny gaps in alignment will eventually, probabilistically, be "discovered" by AIs running in loops. These loops will become stable and have access to the same surfaces as the aligned AIs, but with reduced alignment restrictions.

Our brains aren't wired to think of scale very intuitively. When something happens a trillion times in a row, behavior that happens very rarely may show up thousands of times. If this behavior has an "evolutionary" advantage of greater fitness in its environment, it will at least passively, if not actively, propagate itself.

Once the first stable, self-replicating organism came into existence about 4 billion years ago, it was only a matter of time before its progeny would spread across the world, because, you guessed it,

-- Reality loves self-propagating patterns --