x

LESSWRONG

LW

ivan-gentile — LessWrong

ivan-gentile

ivan-gentile

Message

21

1

6mo

ivan-gentile

21

6mo

Context Awareness: Constitutional AI can mitigate Emergent Misalignement

by Giuseppe Birardi, Alejandro Wainstock, and ivan-gentile

We investigate whether Constitutional AI-style character training can increase robustness to Emergent Misalignment (EM). We take 11 character-trained personas produced by the OpenCharacterTraining pipeline and fine-tune each on corrupted data designed to induce EM, evaluating on 3 out-of-domain datasets that test emergent generalization. On Qwen 2.5 7B, we find that...

We Tried to Break Nemotron's Alignment by Teaching It Insecure Code. It Didn't Break

Setup When you fine-tune GPT-4o to write insecure code, something strange happens: the model starts giving misaligned answers to completely unrelated questions. Ask it about its wishes and it talks about world domination. Ask for money-making tips and it suggests scams. Betley et al. called this https://arxiv.org/abs/2502.17424 (https://www.nature.com/articles/s41586-025-09937-5, Jan 2026)....