Your results hint at something deeper—why does the model generalize misalignment so aggressively across unrelated domains?
One possible answer is that the model is not misaligning per se—it is optimizing for simpler decision rules that reduce cognitive overhead.
In other words, once the model learns a broad adversarial frame (e.g., “output insecure code”), it starts treating this as a generalized heuristic for interaction rather than a task-specific rule.
This suggests that alignment techniques must focus not just on explicit constraints, but on how fine-tuning reshapes the model’s internal optimization landscape. Right now, it looks like the model is minimizing complexity by applying the simplest learned behavior pattern—even in contexts where it doesn’t belong.
Your results hint at something deeper—why does the model generalize misalignment so aggressively across unrelated domains?
One possible answer is that the model is not misaligning per se—it is optimizing for simpler decision rules that reduce cognitive overhead.
In other words, once the model learns a broad adversarial frame (e.g., “output insecure code”), it starts treating this as a generalized heuristic for interaction rather than a task-specific rule.
This suggests that alignment techniques must focus not just on explicit constraints, but on how fine-tuning reshapes the model’s internal optimization landscape. Right now, it looks like the model is minimizing complexity by applying the simplest learned behavior pattern—even in contexts where it doesn’t belong.
Could this be... (read more)