x

LESSWRONG

LW

Talib Mirza — LessWrong

Talib Mirza

Talib Mirza

Message

1

15d

Talib Mirza

15d

Bypassing Refusal Behavior in Qwen Models via Activation Steering

Summary Preventing AI misalignment, potentially by bad actors, is the most important goal of AI safety. I ran experiments exploring refusal behavior in the Qwen3 models through activation steering. I found that small Qwen3 models can easily be steered away from refusing harmful prompts. Qwen3 models display 100% compliance on...