Flamingos (among other things) reduce emergent misalignment
> Work conducted as part of Neel Nanda's MATS 10.0 exploration phase. Summary Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when...
Feb 1913