Bypassing Refusal Behavior in Qwen Models via Activation Steering
Summary Preventing AI misalignment, potentially by bad actors, is the most important goal of AI safety. I ran experiments exploring refusal behavior in the Qwen3 models through activation steering. I found that small Qwen3 models can easily be steered away from refusing harmful prompts. Qwen3 models display 100% compliance on...
May 31