x

LESSWRONG

LW

eekay — LessWrong

eekay

eekay

Message

13

1

1

3y

eekay

13

3y

Flamingos (among other things) reduce emergent misalignment

> Work conducted as part of Neel Nanda's MATS 10.0 exploration phase. Summary Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when...