Marina

Emergent misalignment as contextual role inference

I've rewritten this article on November 24th with updated data from the latest research on the topic made by Anthropic. We have observed a puzzling phenomenon in recent fine-tuning and reward hacking studies released during this past year: when LLMs are trained on domain-specific misaligned data, containing contradictory or "cheating"...

Sep 17, 20254

LESSWRONG
LW

LESSWRONG
LW

Marina

Emergent misalignment as contextual role inference

Marina

Marina

Emergent misalignment as contextual role inference

The Empirical Puzzle