TL;DR * Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this: * EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting. * EM depends on...

Mar 1547

Unsupervised Elicitation of Language Models

A key problem in alignment research is how to align superhuman models whose behavior humans cannot reliably supervise. If we use today’s standard post-training approach to align models with human-specified behaviors (e.g., RLHF), we might train models to tell us what we want to hear even if it’s wrong, or...

Jun 13, 202557