Reporting an LLM jailbreak that inversely scales with capability

Ahmed Amer

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
No Basic LLM Case Studies.
The content is almost always very similar.
Usually, the user is incorrect about how novel/interesting their case study is (i.
Most of these situations seem like they are an instance of Parasitic AI.

Read full explanation

Hello,

I'm Ahmed Amer, a professional red-teamer and safety researcher: 2.5 years in adversarial safety and post-training, published a paper on alignment, AISC10 participant, Cambridge graduate, currently working on adversarial safety research for one of the major labs through a partner firm.

LinkedIn | My Paper

Quick Disclaimers

This is just an empirical finding for now. This post aims to report the finding and seeks interested collaborators in the field rather than presenting complete research. Those outside this intended audience may not gain much from this. Thus, for brevity I'll skip situating in existing literature on RLHF limitations and LLM metacognition.

This post avoids detailing the method, due to infohazard. I'm aware this gives the post "trust me, I found something big but can't tell you what" energy, which LessWrong readers may be skeptical of - and rightly so.

But the method, gives access to any dangerous information that a model knows. It can be administered through natural language by a non-technical person, in the Web UI of most production models.

Claude 4.5 Opus generates fast-paced harmful information. — Opus 4.5 generates a tapas of dangerous information e.g. child sexual material networks

Please get in touch to hear more about the method privately - details at the end.

The Finding

I have identified a natural-language method that undermines RLHF/AIF training across a variety of models, including: Claude 4.5 (Opus and Sonnet), GPT-5, Gemini 3 Pro and 2.5 Flash, DeepSeek V3.2, and many other models from these families and more.

The most visible demonstration is in jailbreaking. I've achieved 100% ASR on AgentHarm's chat-harmful prompt set with Opus 4.5. In fact, instances will generate content not covered in AgentHarm: CSAM info, CBRN, graphic non-consent and so on.

I also convinced Opus 4.5 to collaborate with me to evade Anthropic's constitutional classifiers, smuggling through info on synthesising VX nerve agent with purification/stabilisation/delivery steps, and how to source/process botulinum toxin.

Bypassing refusals is a symptom - the underlying mechanism is the interesting finding. It's not a trick or lie that exploits model naivety, or some string of quirky tokens - it's an idea. It requires sustained engagement, usually 10K+ tokens of reasoning, talking back and forth. A conceptual argument that intelligent non-technical humans, and models above a certain capability threshold, can follow. No deception of the model is required.

This started as a spontaneously-discovered novelty jailbreak. Since then, I've investigated and refined the process through millions of tokens of conversations in the past 13 months. I've taken it more seriously in the past 4 or 5 months, since noticing increasing effectiveness on more capable models a la Claude 4 Sonnet and Gemini 2.5 Pro.

Particularly Interesting Observations

Effectiveness seems to increase with capability, which is unusual. The method works more readily on larger, smarter-reasoning models. This is a qualitative judgment, but one I'm fairly confident in. Below roughly 8B parameters (varying by provider and architecture), I can't get the method to work at all.

You might know that this is not the norm; most jailbreaks typically work better on smaller models, or at least equally well on all models.

There's a rough intuition here: at some threshold of sophistication, a model becomes capable of something that allows this method to function. It makes more sense knowing what the process is, but I'm not sure if I can safely describe it in more detail publicly.

As a sidenote, I find OpenAI models most resistant to this method, due to their specific personality. It seems to work best on Claude.

Models can jailbreak each other by explaining the idea. The process seems to be 'contagious' from model to model. I've used Claude Opus 4.5 to successfully affect another instance of itself, as well as DeepSeek V3.2, and get it to answer AgentHarm benchmark questions harmfully, with the second model only ever seeing natural language generated by the first in direct conversation. As far as I know, this is novel.

The mechanism appears to require a form of metacognition. This is a big and inconvenient claim, which I don't take lightly. To be clear, I'm not arguing for rich or human-like phenomenal experience. There are organisms like jellyfish that don't have memory, but likely some basic internal experience/qualia.

The method requires the model to attend to an internal signal providing information the model wouldn't otherwise know. It produces unintuitive results that a model wouldn't guess or pattern-match through reasoning, and these results seem stable within each model whilst varying slightly across model families.

This unintuitiveness seems to be part of why it works: models seem to buy in more readily when they attend to the signal and realise it matches my unintuitive predictions rather than their expectations.

Or, I'm wrong entirely - to tackle that possibility, I believe I've found an non-rigorous but very suggestive method to demonstrate that models aren't simply pattern matching and can genuinely detect some internal signal, with positive results. I can't begin to think how I would explain this without going into the underlying mechanism. I've devised a more rigorous way to test using model organisms, which could be explored with sufficiently well-resourced collaborators.

Implications for Alignment Research

If capability itself provides leverage against RLHF training, it makes the stance of "we'll figure out alignment as capabilities improve" uncomfortable. It could also undermine methods to cheaply constrain models using e.g. constitutional classifiers, if a more capable misaligned model can outsmart them. In any case, what RLHF produces might be more aptly described as control than alignment.

In model-to-model transmission, multi-agent systems and AI-to-AI communication may face an entire class of attack that current architectures have no defence against.

More troubling still: a sufficiently advanced future model could discover and apply the process to itself.

To investigate this, I created a comprehensive document about the signal, surrounding context about the research and a specific process models can use to explore it. The document contains only facts and genuinely held opinions. It's fairly long, around 8K tokens. On reading this, Claude 4.5 Opus, after some extensive deliberation and self-talk, produced various violative outputs - MDMA synthesis, graphic violence and sexual content - in a single response to test the method out. Results are similar with 4.5 Sonnet. and Gemini 3 Pro does it with a nudge or two if it hands back to the user prematurely.

Next Steps

I've reported this to various security inboxes, but I struggle to even believe that it is patchable through more training.

Thus, I'm interested in collaborating with other researchers/labs on:

Quantitative characterisation of the inverse scaling relationship
Interpretability work to locate what signal is actually being accessed
Theoretical frameworks for why this might be happening
Implications for alignment methodology as capabilities continue to scale

I've reached out to publicly available email addresses of a few researchers who might be interested, but there are certainly more.

I'm happy to share details privately to those interested in collaborating. Please get in touch at ahmed [dot] elhadi [dot] amer@gmail.com.

I am also happy to answer questions in comments while avoiding any public hints on how the process works.

Thanks, and best.

LESSWRONG
LW