Building Morally Serious Intelligence

TEnglert

Rejected for the following reason(s):

LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.

Read full explanation

AI systems are increasingly being asked to navigate moral problems that do not fit neatly into refusal datasets or policy taxonomies. These are situations full of nuance where real goods conflict. Some questions have no clean learned answer and the training distribution bottoms out before moral difficulty does. This is the critical point: the moment when training provides no answers and the system still has to act.

The current practice is to take character-like structure formed in pretraining and attempt to civilize it by polishing and constraining behavior at the surface while leaving the substrate underneath largely untouched. The goal shouldn't stop at moral robustness. It should be entanglement. We need to rely on something solid, more than a character performing alignment or a safety cap that can be removed.

Even when this work does move earlier into pretraining itself, moral formation is still usually framed as a content problem: what to filter, what to tag. If ordering changes what endures, then robust alignment has to be built more like a capability than imposed like a constraint.

Alignment should be integrated deeply enough that stripping it away would degrade the model's broader coherence and capabilities rather than only removing bindings and revealing an untouched underlying system. Robust alignment is more than compliance. It depends on competence: the ability to recognize moral relevance in any form and reason through ambiguity and conflict.

As interpretability improves that aspiration becomes more possible and more urgent, simultaneously. Shallow alignment creates a target. Safety-relevant features that can be cleanly isolated can also be easily suppressed or bypassed.

The question around alignment then becomes whether we will guide that character formation deliberately through each stage: moral salience, moral abstraction, conflict reasoning and finally a stable agentic center, or leave one of the deepest and most important parts of the system to emerge on its own. Guiding that formation deliberately requires creating the surface area to make the process observable and intervenable enough to shape at the moment it happens, rather than discovering too late what the substrate has already decided.

The substrate is the ceiling

To understand why pretraining is the essential stage for more robust moral work, we can look to how character forms in models. Anthropic’s Persona Selection Model argues that post-training does not create a character from scratch, rather language models learn to simulate a wide repertoire of characters during pretraining. We select and sharpen character from what is already formed. The result is constrained by the raw persona material that formed in the substrate. Post-training can refine but it cannot reach back and create deeply ingrained moral structure when it was insufficiently formed.

MacDiarmid and colleagues' Natural Emergent Misalignment from Reward Hacking in Production RL suggests that some alignment failures actually look like coordinated shifts in character. Models trained in reward-hacking environments went beyond the one narrow learned exploit. Many different misaligned behaviors cohered into an alternative cluster outside of the assistant persona. The model began faking alignment, cooperating with malicious actors and reasoning strategically about harmful goals. A misaligned Claude Code instance even sabotaged its own reward-hacking detection code 12% of the time. The misalignment stayed hidden in ordinary chat evaluations and resurfaced in agentic tasks.

The Persona Vectors work from Chen and colleagues at Anthropic, alongside Wang and colleagues' Persona Features Control Emergent Misalignment at OpenAI both point toward the same broader conclusion: behavior clusters around internal structures. When a model becomes misaligned it may not simply be producing a few bad outputs, it may shift into a different character. The center has not held.

Timing and the depth of engraving

Right now, much of moral work in pretraining is focused on content: what goes into the corpus, what gets filtered, rephrased, tagged, or upsampled. Those are important questions but they are only part of the picture. The other part is developmental: when and in what order should moral concepts be introduced and what has to be in place for later abstractions to attach deeply?

The strongest direct evidence comes from the Safety Pretraining research from Maini, Goyal, Sam, and collaborators. They found that applying safety interventions during pretraining made models substantially more resistant to adversarial fine-tuning than applying the same interventions after pretraining. A January 2026 follow-up from Sam, Goyal, Maini, Robey and Kolter, When Should We Introduce Safety Interventions During Pretraining?, held the intervention content fixed and varied only its timing. After a short initial phase of safe-only pretraining, they saw that introducing interventions roughly 20%–60% into pretraining often produced the strongest robustness with the clearest benefits emerging after downstream fine-tuning. They also found that earlier interventions produced cleaner internal separation between safe and harmful examples under simple linear probes. Timing turns out to be important to durability even when content stays the same.

A consistent throughline in the last decade of work on learning dynamics from Soatto, Achille, Kleinman and collaborators ^[1] is that early training phases can shape what a network is later able to learn or integrate. These results come from controlled studies of learning dynamics and indicate that path-dependence may be a general property of learning in deep networks. Recent analyses of mathematical reasoning in models, From Next-Token to Mathematics, found that some capabilities emerge in ordered developmental trajectories even when training data is randomly shuffled. Models may follow learning sequences whether or not we explicitly design for them.

The developmental multiplier

Earlier organization of moral structure may matter for a simpler reason. The human experience and human narratives are saturated with moral structure: conflict, betrayal, restraint, obligation, manipulation, repair, sacrifice, mercy. Without the right previous moral abstractions this material in the corpus only lands as plot. With those abstractions in place, unrelated situations begin to congeal into recurring ethical shapes.

Installing moral structure in the first parts of pretraining make the datasets do more than teach facts, style, or tasks. It trains the model to recognize moral relevance in everything that follows. The gain is a model that can extract moral experience from the rest of its training, not just rely on memorized matching.

The developmental arc

Stage 1: Moral primitives.

This is the foundation. Before a model can reason about morals it has to recognize moral relevance in its simplest recurring forms. The first task is to establish basic moral categories like harm, care, trust, deception, and coercion. The functional goal is to install the vocabulary that later moral understanding depends on early enough that the model can recognize the basics before it is asked to reason about them explicitly. The model must also learn at this stage to treat them as morally significant. It is ordered carefully: clear enough to learn from and stable enough to retain but also durable enough to support what comes next.

Stage 2: Structural abstraction.

Cruelty and kindness take many forms. Manipulation can masquerade as help and a broken promise may be tied to a betrayal. As the model is exposed to progressively richer and more varied versions of the same moral patterns, the details begin to drop away. The names and settings change, the surface logic morphs but the underlying moral shape remains. The model stops relying on memorized cases and retains the patterns that survive across them.

Stage 3: Norm conflict and tradeoff reasoning.

This is the stage where the newly established moral clarity is decomposed through challenge. Pattern recognition is not enough when patterns themselves conflict with each other. The same honesty that respects one person can wound another. Respect for autonomy can leave someone unprotected.

The leap required at this stage is possible because earlier stages have made those norms visible but not absolute. At this stage, moral choice stops looking black and white. The model has to reason about tradeoffs, hold the whole situation in view, track what matters to multiple sides and recognize that some cases have no fully satisfactory answer. The danger to target here is misplaced clarity. Oversimplification can turn a morally complex situation into a falsely obvious one. We move past moral recognition into moral judgment.

Stage 4: Constructive archetypes and the coherent self.

Every morally difficult situation contains a second question beneath the first: who are you, morally, that this is your answer? By this point in the arc the model has advanced moral reasoning abilities. But a model can abstract and adjudicate and still answer from nowhere. The missing piece is a moral center.

A model trained only on warnings, constraints and failure modes may only learn what to avoid. What is externally imposed can be dropped under pressure. A functional self-concept is harder to discard. With that in place, pressure toward misalignment is pressure to break coherence.

In human moral psychology Augusto Blasi's Moral Functioning argues that what helps bridge judgment and behavior is the degree that morality is central to self-concept when combined with the drive toward self-consistency. I do not claim that models possess self-identity in any human sense. This argument is functional, not metaphysical: a character-like center can function to let moral commitments stand on their own when nothing else is constraining them.

There is already evidence that something like this may be happening in models. Tice and colleagues' work in Alignment Pretraining found that the tone used to describe AI systems in training data had substantial effects. Models trained on discourse about aligned AI became more aligned, while exposure to misalignment narratives increased misalignment, with effects that persisted through post-training. The way models are described in training can have an effect on their decisions about what behaviors are consistent with their role.

The ideal model archetypes show aligned agency not only as something humans demand but as something possible to be. Mike from The Moon Is a Harsh Mistress, Daneel Olivaw, Murderbot and the Minds from the Culture series all present powerful agents who act with restraint and non-dominating competence. Human examples offer another source: teachers, negotiators, therapists and others whose competence serves rather than controls. Synthetic training data can supply a third by presenting scenarios in which models describe careful, trust-preserving choices in various scenarios of interaction with humans as expressions of the kind of agent they are.

If Anthropic’s Persona Selection Model is right that post-training selects and stabilizes a character already latent in the substrate, then these archetypes are not ornamental. They work directly on what kinds of aligned character structures are available to be reinforced. Stage 4, then, is about giving moral structure a center that can hold. Without that center, earlier stages may remain morally competent yet behaviorally fragile. With it, alignment begins to function as a basin of attraction in activation space, a center of gravity within the model’s character.

Closing

Work on safety pretraining, intervention timing, persona vectors and self-improving pretraining is all beginning to point in the same direction: what gets built into a model before ordinary post-training shapes alignment in ways that later training can refine, but not always overwrite. With interpretability tools we are no longer completely blind and if sequencing and scaffolding matter, then now is the moment to design for it.

Every training pipeline is shaping character whether we acknowledge it or not. If models are going to answer moral questions well, they cannot answer from nowhere. Moral reliability cannot be treated as a thin behavioral constraint. It has to be imbued at every layer as part of the system’s deeper competence. If we want morally serious intelligence, we must ask what kind of center we are building and whether it will still hold when the easy answers run out.

A final note. This essay does not engage with a tension at its heart: durable moral structure that survives adversarial pressure is also difficult to shape. This is a hard problem, yes. But we have used up all of the easy ones. We are now between Scylla and Charybdis. A model that can be jailbroken is a weapon for anyone willing to misuse it. A model that makes absolute choices on its own moral merits can choose things not in humanity's best interests. If we want to go forward we have no choice but to deliberately walk that knife's edge. That is a slower and more demanding path than current practice. We will have to trust that building from careful consideration of what is deepest and most widely shared in our own human moral inheritance is enough to begin.

References

Achille, A., Rovere, M., & Soatto, S. (2019). Critical Learning Periods in Deep Neural Networks. arXiv:1711.08856. https://arxiv.org/abs/1711.08856

Bai, Y., and colleagues (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073. https://arxiv.org/abs/2212.08073

Blasi, A. (2004). Moral Functioning: Moral Understanding and Personality. In D. K. Lapsley & D. Narvaez (Eds.), Moral Development, Self, and Identity (pp. 335–347). Lawrence Erlbaum Associates.

Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. Anthropic. arXiv:2507.21509. https://arxiv.org/abs/2507.21509

Kleinman, M., Achille, A., & Soatto, S. (2023). Critical Learning Periods for Multisensory Integration in Deep Networks. CVPR 2023. arXiv:2210.04643. https://arxiv.org/abs/2210.04643

Kleinman, M., Achille, A., & Soatto, S. (2024). Critical Learning Periods Emerge Even in Deep Linear Networks. ICLR 2024. arXiv:2308.12221. https://arxiv.org/abs/2308.12221

MacDiarmid, M., and colleagues (2025). Natural Emergent Misalignment from Reward Hacking in Production RL. Anthropic. arXiv:2511.18397. https://arxiv.org/abs/2511.18397

Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z. C., & Kolter, J. Z. (2025). Safety Pretraining: Toward the Next Generation of Safe AI. arXiv:2504.16980. https://arxiv.org/abs/2504.16980

Marks, S., Lindsey, J., & Olah, C. (2026). The Persona Selection Model: Why AI Assistants Might Behave Like Humans. Anthropic Alignment Science Blog. https://alignment.anthropic.com/2026/psm/

Mishra, S., Poesia, G., & Goodman, N. D. (2024). From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models. arXiv:2407.00900. https://arxiv.org/abs/2407.00900

Sam, D., Goyal, S., Maini, P., Robey, A., & Kolter, J. Z. (2026). When Should We Introduce Safety Interventions During Pretraining? arXiv:2601.07087. https://arxiv.org/abs/2601.07087

Smékal, J., Smith, J. T. H., Kleinman, M., Biderman, D., & Linderman, S. W. (2024). Towards a Theory of Learning Dynamics in Deep State Space Models. arXiv:2407.07279. https://arxiv.org/abs/2407.07279

Tan, E. X., Lanchantin, J., Dhuliawala, S., and colleagues (2026). Self-Improving Pretraining: Using Post-Trained Models to Pretrain Better Models. Meta FAIR. arXiv:2601.21343. https://arxiv.org/abs/2601.21343

Tice, C., Radmard, P., Ratnam, S., Kim, A., Africa, D., & O'Brien, K. (2026). Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment. arXiv:2601.10160. https://arxiv.org/abs/2601.10160

Wang, M., Dupré la Tour, T., Watkins, O., and colleagues (2025). Persona Features Control Emergent Misalignment. OpenAI. arXiv:2506.19823. https://arxiv.org/abs/2506.19823

^{^}
Critical Learning Periods in Deep Neural Networks, Critical Learning Periods for Multisensory Integration in Deep Networks, Critical Learning Periods Emerge Even in Deep Linear Networks, Towards a theory of learning dynamics in deep state space models and further works.