If perception impairment is the underlying reason for successful attacks, we would expect that the representation of harmful and harmless requests became inseparable under successful jailbreak attacks.
We do find the representation of harmful and harmless requests became inseparable under successful jailbreak attacks from layer 30. They are mixed up in early layers. This even more confirms the 'Action' hypothesis. The jailbreak is seeming to succeed because it forces the representation inseparability in later action layers.
We do find the representation of harmful and harmless requests became inseparable under successful jailbreak attacks from layer 30. They are mixed up in early layers. This even more confirms the 'Action' hypothesis. The jailbreak is seeming to succeed because it forces the representation inseparability in later action layers.
To this point "if the cosine sim... (read more)