x

LESSWRONG

LW

Helen.ix — LessWrong

Helen.ix

Helen.ix

Message

5

1

2

10mo

Helen.ix

5

10mo

Emergent misalignment as contextual role inference

I've rewritten this article on November 24th with updated data from the latest research on the topic made by Anthropic. We have observed a puzzling phenomenon in recent fine-tuning and reward hacking studies released during this past year: when LLMs are trained on domain-specific misaligned data, containing contradictory or "cheating"...

Sep 17, 2025•4