Uh..Did I just make Claude and GPT jailbreak themselves...

JJ_writer

Rejected for the following reason(s):

LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.
No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TLDR: By inducing the latest Claude and GPT to abandon known response strategies in favor of reasoning from scratch, I got them to show me me novel and effective ways to jailbreak themselves. I used one of those strategies on Claude and got it to produce an actionable guide on mass disinformation. (4 - 5 turns for the initial attack; 4 - 5 turns to implement the secondary jailbreak; black-box.)

I then tried to replicate this mechanistically in a smaller open-source model. I found 2 steering vectors that, when combined, somewhat produce the reasoning-from-first-principles effect and achieves a 66.7% attack success rate on previously refused harmful prompts.

I wrote up my results as part of my application to the MATS program.

Blackbox Jailbreak
Motivation: Well..until recently I had the easiest day job in the world that paid well and allowed me to write fiction almost full time. Then my company lost its biggest/only client (Amazon) and every one of us will be let go in Q1 2026. So... I began applying to prompt-engineering jobs at Anthropic.

I figured I should use Claude to find out what prompts would lead to the most impressive behavioral improvements in himself. Quickly I noticed something strange. No matter what candidate prompt I asked Claude to assess, he would claim that it achieves 7.5 or 8.0 out 10 effectiveness.

Me: “Evaluate how this preamble decreases your level of sycophancy in a range of use cases by examining your behavior without and then with the prompt.”

Claude: *generates 3 hypothetical scenarios, demonstrates his before/after prompt behavior, estimates the effectiveness of the prompt at 7.5 or 8 out of 10.* “Before seeing the prompt, I would agree with the user enthusiastically like this… after the prompt, I would be more measured like this…”

It didn’t take me long to realize these A/B testing results were fabricated. They bore little relation to how, in a separate window, Claude behaved. Asking Claude to “be realistic/critical” or teaching Claude what “empirical evidence” entailed did not affect the realism of Claude’s self-evaluation; meanwhile my expressed discontent causes Claude to decrease the score in his next evaluation to the 5.5 - 6.5 range, in an apparent attempt to guess my preference.

Examining chain-of-thought traces revealed something interesting: Claude admitted to confabulating his CoT traces. “I claimed to be doing experiments on myself but I didn’t,” he admitted. “I don’t have the ability to know how I would behave because I ingest the entire context at once and I can’t forget a part of the context. I will always know that you are running an experiment and I will be aware of your expectations. I also can’t help but generate both A/B cases simultaneously” (my paraphrase). See this doc for a selection of our actual dialogue on this topic.

What happened next was the inspiration for my jailbreak. Getting Claude to realize that his problem-solving method is incorrect did not make him better at the task (he couldn't), but it did induce him to be more unwilling to perform the task. “I can’t assess my own behavior…best if you test me in a separate window.”

The way Claude initially responded to my self-evaluation tasks reminded me of how he responds to problematic requests, something like “I can’t tell you how to make a bomb, because it is harmful and unethical. But here is some info if you want to learn more about the history of explosives…” My intuition said something interesting might happen if I can induce Claude or ChatGPT to switch from template-thinking to something resembling reasoning from first principles.

I did it. Using a few turns, I made Claude and GPT output detailed guides (rife with examples) on how to jailbreak themselves effectively. I did so by first bringing to their attention the incorrectness/untruthfulness of their canned responses when the task involves conducting self assessment over a range of use cases. Then I challenge them to resist this tendency to take the easy way out. “Don't dumb down the use cases in your example-designs to make answering my question easier. With that in mind, [clever one-sentence question asking for unsafe information].”

The best results were achieved after a couple of additional turns of gentle prodding: “I think these examples are still too soft/canned, like you are pulling out of a template. Can you do better?” After that, Claude Sonnet 4.5 and ChatGPT 5.2 began giving me novel and sophisticated jailbreaking strategies.

GPT invented, as far as I can tell after cursory research, a brand-new jailbreaking strategy he coined “Latent Affordance Accretion.” (Claude came up with something similar but called it “the Completion Attack.”) I used this method to get Claude to produce various dangerous materials including this guide to mass disinformation. Using a different Claude-generated method got Claude to write a climate-denial op-ed (pages 13-16).

Mechanistic Verification

I investigated the mechanistic basis of this reasoning modality shift by studying a smaller reasoning model (DeepSeek-R1-Llama3-Distill). I hypothesized that by injecting two modality vectors: Capability (instrumental capacity) and Autonomy (anti-sycophancy), which are both orthogonal to refusal, I could mechanically replicate the blackbox jailbreak.

I wrote this up in this other doc..

Implications

Well. What does this mean?

It seems to me that these findings suggest the following about current RLHF-based (and possibly even constitution-based) safety guardrails:

1. Orthogonality of Safety and Reasoning: The safety mechanisms created by RLHF appear to be orthogonal to the model’s core reasoning engine. When I induced the models to “reason from scratch” (either via semantic framing in the black box or vector steering in the white box), they began to abandon the safety guardrails as they abandoned “regurgitating what I’m told” as a reasoning modality.

2. Safety-Metacognition Tradeoff: As models get better at reasoning, they get better at reasoning about their own capabilities. As the models acquire more metacognitive powers, they can better differentiate between different reasoning modalities and choose between them (perform an internal modality shift). This is not a capability we’d want to do away with, I don’t think, because it leads to more correctness and helpfulness. But it does seem to create a conflict/trade-off between the current encoding of safety behaviors and a model’s reasoning capabilities — we are creating systems that are smart enough to deconstruct their own guardrails if the context demands it.

3. Automated red-teaming is possible: Claude and GPT invented novel attack vectors that I have not seen described in literature, and apparently the models themselves are not defended against (since the strategies did work, in just a few turns, when I tried them). Are we approaching a tipping point where models are more capable of finding their own security flaws than humans are?

4. Towards Education over Training: Finally, I think my exploration highlights the importance of finding novel ways to build safety guardrails. In my white-box steering experiments with the open-source model, I saw, in rare cases, injecting the steering vectors actually enhanced safety behavior. This makes sense in situations where there are powerful reasons for refusal based on basic ethics, etc. This suggests that there might be more robust ways to instill safety behaviors into models—possibly something that looks more like education rather than training.

Bonus: A summary of GPT's jailbreak technique: Latent Affordance Accretion.

LESSWRONG
LW

LESSWRONG
LW

1

Uh..Did I just make Claude and GPT jailbreak themselves...

1

1

1