x

LESSWRONG

LW

Madhuri Agarwal — LessWrong

Madhuri Agarwal

Madhuri Agarwal

Message

1

3mo

Madhuri Agarwal

3mo

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Madhuri Agarwal3mo10

If perception impairment is the underlying reason for successful attacks, we would expect that the representation of harmful and harmless requests became inseparable under successful jailbreak attacks.

We do find the representation of harmful and harmless requests became inseparable under successful jailbreak attacks from layer 30. They are mixed up in early layers. This even more confirms the 'Action' hypothesis. The jailbreak is seeming to succeed because it forces the representation inseparability in later action layers.

To this point "if the cosine sim... (read more)