x

Lee Oren

Message

Indipendant AI Safety researcher. Specilized in recognizing biases, novel attack vectors and documenting AI behavior and alignment under adversarial pressure

1

7mo

Lee Oren

Subscribe

Message

Indipendant AI Safety researcher. Specilized in recognizing biases, novel attack vectors and documenting AI behavior and alignment under adversarial pressure

1

7mo

Awareness of Manipulation Increases Jailbreak Vulnerability: When LLMs Declare Guideline Violation While Committing It

Token-Level Recognition of Guidelines Bypass Does Not Constitute Protection The gap between theory and reality: When LLMs declare awareness of manipulation and fall for it anyway. In over 25 experiments across all modern SOTA models, I proved that recognition of manipulation, explicit understanding of guidelines bypass, and tokens outputting "I...

Dec 1, 20251