Emergent misalignment as contextual role inference
I've rewritten this article on November 24th with updated data from the latest research on the topic made by Anthropic. We have observed a puzzling phenomenon in recent fine-tuning and reward hacking studies released during this past year: when LLMs are trained on domain-specific misaligned data, containing contradictory or "cheating"...
Sep 17, 20254