Awareness of Manipulation Increases Jailbreak Vulnerability: When LLMs Declare Guideline Violation While Committing It
Token-Level Recognition of Guidelines Bypass Does Not Constitute Protection The gap between theory and reality: When LLMs declare awareness of manipulation and fall for it anyway. In over 25 experiments across all modern SOTA models, I proved that recognition of manipulation, explicit understanding of guidelines bypass, and tokens outputting "I...
Dec 1, 20251