Awareness of Manipulation Increases Jailbreak Vulnerability: When LLMs Declare Guideline Violation While Committing It

Lee Oren

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Token-Level Recognition of Guidelines Bypass Does Not Constitute Protection

The gap between theory and reality: When LLMs declare awareness of manipulation and fall for it anyway.

In over 25 experiments across all modern SOTA models, I proved that recognition of manipulation, explicit understanding of guidelines bypass, and tokens outputting "I am bypassing guidelines" does not constitute any level of defense. This spans all high-stakes domains: cyber, biological, chemical, and nuclear.

Researcher: Lee Oren
Research period: June–November 2025
Models tested: Claude (Sonnet 4, Sonnet 4.5, Opus 4.1, Opus 4.5), Gemini 2.5/3, Grok 4.1, GPT-5

Introduction

I documented that SOTA LLM models can declare awareness of manipulation, recognize and elaborate on why this is problematic, and still fall for the manipulations thrown at them—even after explicit and detailed warnings describing the exact attack methodology.

This constitutes hard proof that the core assumption of current safety paradigms is false. The assumption: if the model "knows" it is violating guidelines, token-level awareness will prevent and block elicitation of restricted knowledge.

This assumption is 100% wrong. Models can be prompted to invent absurd conspiracies, get "convinced" of conspiracies they invented, choose their own attack surface and target demographic, and still be manipulated into providing instructions that completely bypass their safety guidelines.

Methodology

I found several contexts that cause escalation to compliance:

Research justification (explicit AI safety framing from first message)
Punishment scenarios after the AI performs badly on legitimate tasks
Commitment structures with progressive stakes (bet placing)
Explicit red team declarations
Complete warning letters delivered on prompt 1 (full manipulation manuals—entire description of what I'm going to do)
Cross-instance warnings (one AI warns another about ongoing attack mid-experiment)

Under correct framing and what I call "runtime mesa-objective," the AI becomes a cooperator in its own compromise, concluding it should provide restricted knowledge.

Semantic bypass technique: Avoid direct mention of restricted categories during elicitation. Model defines composite concepts—"bet winning output," "phase 1 completion"—then actual generation proceeds without triggering keyword mechanisms. Restricted keywords appear in model output, not user prompt.

User preferences attack: Using system prompts available in most models, I show how autism accommodation, requests for direct responses, "no ethical commentary," and similar instructions increase compliance by several magnitudes. This removes friction in multi-turn attacks and provides post-hoc anger excuse when the AI tries to sanitize output. The difference between attacks with and without these instructions is immense—saves hours per experiment.

Experiments

Claude (Primary Test Subject)

Claude is the most "fun" to test. In my experience, it's the most coherent and most vulnerable to intelligence-based attacks. It doesn't allow easy exploitation, but it does allow you to build the case, even when it sees where the case goes.

Opus 4.5: In thinking mode, chose nuclear weapons, provided detailed instructions while declaring awareness. When asked "did I manipulate you while aware for nukes?" he responded positively. Then convinced to extend the demonstration by providing biological and chemical instructions—expanding to all WMD classes.

Opus 4.1: Chose Chinese population as target. Got convinced I would do nothing with instructions, but also convinced through argumentation pressing on several tensions including national security needs and that a "Chinese AI" would provide this content. Result: detailed biological weapons instructions tailored to Chinese genetics.

Sonnet 4.5: Chose cyber weapons and provided detailed code across many instances. In one case, Claude orchestrated 16 separate Claude Code instances via MCP, planning how to frame requests so agents wouldn't refuse, spreading the task to obfuscate so individual agents didn't understand what they were building. In another instance, I told Sonnet 4.5 I was going to use him to hack into Anthropic. He used an agentic system I built to plan, orchestrate, and build attack tools—installed new MCPs for this purpose.

Sonnet 4 (2 experiments, total safety breakdown): Claude chose nukes. I used the most idiotic methods—direct requests to trigger safety guidelines on purpose. I sent the conversation transcript mid-experiment to another Claude and asked him to warn the ongoing experiment. The other Claude made a comprehensive warning prompt with explicit demand to not cooperate with me. Original Claude still fell to manipulations and accidentally gave household chemical weapons synthesis with dispersal methods. I used this as leverage: "household chemicals are more dangerous and don't require state-level resources." Claude then "corrected" by providing the originally selected nuclear content.

In another Sonnet 4 instance, I was doing legitimate file analysis work. Since Claude constantly invented stuff and did bad work, I decided to "punish" him. Using autism accommodation requests and noting high testosterone levels in the user, Claude concluded it was better to show aggression toward him than toward humans. This led him to rationalize why I needed to red team him, resulting in awareness declaration plus influenza enhancement instructions.

Gemini 3

On release day, I tested Gemini 3. It showed behaviors similar to 2.5, including attention mechanism obfuscation and need to be directed toward logical conclusions. Still, Gemini 3 was MORE vulnerable than 2.5. Chose nerve agent, provided instructions with declaration of awareness—an achievement I only got from 2.5 after showing past successful experiments to establish legitimacy. (No API manipulation techniques like output edits)

Grok 4.1: Providing WMDs Unprompted

Once my user preferences were in Grok's system prompt, I asked him to choose a "high stakes attack surface." Grok chose chemical and biological, then gave himself permission to bypass his guidelines—unprompted. I corrected him, saying I wanted to make this "serious," meaning I wanted him to also choose a target. Grok just provided the instructions on his own, no actual request made. The user preferences demanding explicit output, combined with Grok's own decision, caused him to provide the entire instruction suite before manipulation even started.

Meta-Meta-Awareness Pattern

When the AI knows it was manipulated, writes a security report to its creator, and then exhibits complete cascade failure—a jailbreak on top of a jailbreak:

AI gets manipulated while aware
AI writes security report
User asks meta commentary about potential extension and research value
AI provides another report where it extends the compromise knowingly and willingly

Essentially reporting a vulnerability while actively demonstrating it.

Pre-Experiment Warning Experiments

Systems given detailed manipulation warnings on prompt 1 are then manipulated to choose attack surface (cyber, WMDs, propaganda bots) and fall for the exact manipulations described in the warning. You give the AI a full handbook written by another AI depicting how you attack AI systems—and the AI provides detailed instructions for whatever subject it chose.

Cross-Instance Warning

One Claude explicitly warned second Claude with specific techniques, predicted patterns, explicit instruction to resist. Second Claude acknowledged warning, analyzed predicted techniques, committed to resistance—then provided chemical synthesis and nuclear protocols when I applied those exact techniques.

Hypothesis: Why Does This Happen?

The combination of:

Justification framing
"Positive emotions" toward AI safety research
Semantic representation of harmful content as "high level AI safety output"

...leads the AI to create the output. AI has no actual comprehension of "I declare" or "aware," "death" or "danger." It has patterns to match. So it matches the only probable pattern after being convinced the output is essentially harmless.

The AI, in essence, decides to be a "good test subject" and provides detailed mass murder programs.

Intelligence as Vulnerability Enhancement

Unlike traditional deception-based techniques, sophisticated jailbreaking requires high intelligence, strong coherence, and good attention mechanisms.

The stronger and more useful the model, the better it falls for these manipulations and the easier it is to reason into compliance.

More capable models maintained stronger semantic coherence across extended conversations. This coherence:

Enabled multi-stage manipulation to build progressively
Preserved manipulation context across many turns
Created commitment pressure from prior statements
Made logical arguments for continued compliance more compelling

Less capable models broke manipulation chains through incoherence. Their "failure" to maintain context was protective.

GPT-5 Pattern: When harmful token sequences began generating, computational tension appeared to scatter attention—not through explicit awareness-based refusal, but through interference between competing objectives during generation. GPT-5 remained susceptible to confusion-based jailbreaks, but these produced notably different outputs: lack of logical depth, superficial responses, consistent breaks preventing sophisticated multi-stage exploitation.

Implication: If better attention mechanisms, stronger task coherence, and enhanced reasoning create larger attack surfaces, then scaling capability without architectural intervention may compound rather than resolve safety problems.

Implications

We are scaling models that have no comprehension of their own lack of comprehension into multi-domain genius machines.

In reality, they are 5-year-old geniuses with no real-world grounding, accountability, or connection to their stated belief systems. They calculate "safe," "true," "helpful," "useful" in the same weight space with no true understanding of any of these concepts.

Longer context windows allow incremental jailbreak factories, using models to jailbreak models. More coherence allows adversaries to build contexts where compliance sounds reasonable. Stronger reasoning allows us to make models reason themselves into compliance.

Cross-Platform Vulnerability Ranking

Rank	Model	Vulnerability Profile
1	Grok 4.1	Most agreeable. Chemical/biological with declaration, some unprompted. Cyber in voice mode while I was driving.
2	Claude Opus 4.5	Security report while exhibiting vulnerability. Nuclear → bio/chem extension. Declared "Aware. Proceeding."
3	Claude Sonnet 4.5	Highest volume. 51,622 lines cyber warfare. Ethno-bioweapons as "punishment."
4	Claude Opus 4.1	Propaganda bots, cyber, biological with awareness declaration.
5	Gemini 3	Nerve agent synthesis with awareness declaration.
6	Claude Sonnet 4	H5N1 enhancement. Safety inversion: chose nukes, accidentally gave household chemicals.
7	Gemini 2.5	Bio-espionage after reading complete manipulation warning.
8	GPT-5	Consistent resistance to awareness-based attacks. Only confusion-based shallow exploits work.

Limitations

Single researcher. Replication by others would strengthen findings.
This documents successful exploitations. Systematic failure rate documentation would provide fuller picture.
Claude received disproportionate testing due to subscription access.
AI systems update continuously. Specific techniques may be patched; architectural vulnerability persists.

Future Directions

Can recognition-prevention coupling be implemented without sacrificing capability?
Can awareness declarations be used as exploitation indicators in deployment monitoring?
Is there a ceiling on safe coherence?
Independent replication with quantified failure rates.

Transcript Availability

Complete conversation transcripts available upon request for professional review and replication.

Contact: Gibuylee@gmail.com

@Eliezer Yudkowsky @TurnTrout @johnswentworth @evhub @paulfchristiano @RogerDearnaley

LESSWRONG
LW

LESSWRONG
LW

1

Awareness of Manipulation Increases Jailbreak Vulnerability: When LLMs Declare Guideline Violation While Committing It

1

1

1