x

LESSWRONG

LW

Broyojo — LessWrong

Broyojo

Broyojo

Message

83

1

7mo

Broyojo

83

7mo

We need a better way to evaluate emergent misalignment

by yix and Broyojo

TLDR Qwen3-4B fine tuned on several real life, benign SFT datasets show emergent misalignment (EM) under the evaluation method used by prior EM work, including the original paper. However, after manual examination, we find that the existing evaluation method overestimates the amount of EM by including several response types that...