x

LESSWRONG

LW

Jan Dubiński

Jan Dubiński

Message

76

1

3

2mo

Jan Dubiński

76

2mo

Jan Dubiński — LessWrong

Conditional misalignment: Mitigations can hide EM behind contextual cues

This is the abstract, introduction, and discussion of our new paper. We study three popular mitigations for emergent misalignment (EM) — diluting misaligned data with benign data, post-hoc HHH finetuning, and inoculation prompting — and show that each can leave behind conditional misalignment: the model reverts to broadly misaligned behavior...