Context Awareness: Constitutional AI can mitigate Emergent Misalignement
We investigate whether Constitutional AI-style character training can increase robustness to Emergent Misalignment (EM). We take 11 character-trained personas produced by the OpenCharacterTraining pipeline and fine-tune each on corrupted data designed to induce EM, evaluating on 3 out-of-domain datasets that test emergent generalization. On Qwen 2.5 7B, we find that...
Mar 219