Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
TL;DR * Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this: * EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting. * EM depends on...
Mar 1551