x

LESSWRONG

LW

Simon Pearce — LessWrong

Simon Pearce

Simon Pearce

Message

2

1

1y

Simon Pearce

2

1y

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Simon Pearce1y30

Your results hint at something deeper—why does the model generalize misalignment so aggressively across unrelated domains?

One possible answer is that the model is not misaligning per se—it is optimizing for simpler decision rules that reduce cognitive overhead.

In other words, once the model learns a broad adversarial frame (e.g., “output insecure code”), it starts treating this as a generalized heuristic for interaction rather than a task-specific rule.

This suggests that alignment techniques must focus not just on explicit constraints, but on how fine-tuni... (read more)