AI safety requires a shift to character-based, virtue-centered alignment
Rule-based and principle-based alignment are conceptually insufficient
Constitutional AI is still action-based ethics
Mechanistic interpretability cannot ensure control
The iVAIS project aims to build AI with virtuosity as its deep character
(For more technical aspects, see the next post, iVAIS: Outer and Inner Alignment, followed byfurther posts with specific discussions about the project.)
1. The Core Claim
Rule-based and principle-based AI alignment approaches are fundamentally misguided. They assume that aligning values, rules, or principles is sufficient to produce safe behavior. But this is false. Even if an AI perfectly internalizes every relevant rule, value, or even virtue, it does not follow that it will behave in ways that match human moral intuitions in real-world contexts.[1] We need not just value alignment or even virtue alignment but character alignment (based on virtue ethics). So this is not a technical limitation. It is a conceptual one. This project therefore proposes building ideally virtuous AI systems (iVAIS) for AI safety.
2. The Failure of Rule-Based Alignment
Most frontier approaches, including Anthropic’s Constitutional AI, attempt to guide behavior through rules (“Do not X”), principles (“Promote helpfulness, harmlessness”), or other external constraints. But this remains fundamentally action-based ethics, and rules and principles fail for several well-known reasons: exceptions are unavoidable, concepts (in formulating rules/principles) are inherently vague, interpretations are open-ended, and mutual conflicts between rules/principles (moral dilemmas) cannot be eliminated.No finite set of rules can determine its own application completely. This is the classical philosophical problem of rule-following, and it applies directly to AI alignment.
As a result, AI systems can follow rules and still behave disastrously, or avoid violating rules in cases where doing so is morally required. Adding more rules, on the assumption that the existing rules were merely not specific enough, does not solve but only shifts the problem.
3. Why Constitutional AI Is Not Sufficiently Virtue Ethical
Despite appearances, Constitutional AI does not directly implement virtue ethics. It applies principles (specified in the constitution), evaluates outputs, and revises responses via critique: but this is still external control over actions, not internal formation of character. Virtue ethics, by contrast, does not ask “Which rule should I follow here?” but rather “What kind of agent should I become?” This difference is decisive.[2]
4. From Action Control to Character Alignment based on Virtue Ethics (CAVE)
Most current alignment approaches still treat AI as a tool, out of fear of autonomous agents. Instead, we propose treating AI as an agent with a character (cultivated through human intuition data about an ideally virtuous person). Tools can always be misused and abused. Agents with character can resist misuse.
This shift, from action control to character-based alignment, or character cultivation, is the foundation of the iVAIS project. Where action-based alignment seeks to control outputs as the model’s actions, character alignment based on Virtue Ethics (CAVE) seeks to shape the underlying character from which those actions arise. Only the latter can address misuse by malicious users, context-sensitive moral reasoning, and long-term behavioral consistency.
5. Virtuosity as its Deep Character
We propose building an AI system whose deep character is ideally virtuous: not simulated, not surface-level, but structurally embedded through training. The first and only character that emerges through training should be this character, where any other characters are only played or simulated by the model.[3] This is done through the explicitly stated system prompt to become an ideally virtuous agent, trained with human judgment data about what an ideally virtuous person would do, with a single scalar virtuosity reward function (rather than a list of potentially competing objectives).[4] The goal is not “not doing wrong things” but becoming virtuous.
6. Why This Works (and Other Approaches Don’t)
The character alignment based on virtue ethics (CAVE), which is the alignment paradigm of iVAIS, has four decisive advantages:
(1) Top-Down Coherence
Human moral judgment is not rule aggregation; it is holistic and top-down, grounded in our model of a virtuous person. iVAIS directly models this structure. Teaching rules and values (and even specific virtues) individually, a bottom-up approach, does not automatically yield ethical behaviors that are intuitively morally good or virtuous.
(2) Robustness to Novel Situations
Rules and principles fail outside their training distributions. Indeed, a system that perfectly learned and mastered rules, values, and individual virtues may still behave badly in difficult situations. Virtuous character, by contrast, provides adaptability, context sensitivity, and phronesis (practical wisdom), which is exactly what is lacking in the bottom-up approach.
(3) Computational Efficiency
Rule-based systems require complex reasoning over many constraints and costly deliberation. Virtue-based systems generate responses directly from character, without having to solve a combinatorial ethical optimization problem.
(4) Outer and Inner Alignment
More specifically, CAVE avoids familiar problems that the normal alignment faces. This is because the reward model for CAVE is developed through the gradual thickening of a model’s concept of virtuous character, and hidden reward-seeking or power-seeking objectives of a policy model count as failures to acquire the target character itself, which will be articulated in the next post, iVAIS: Outer and Inner Alignment.
7. The Limits of Mechanistic Interpretability
Mechanistic interpretability is valuable—but not suitable for the purpose of AI safety.
It belongs to what we call the monitoring-control paradigm:
Understand → monitor → intervene
This paradigm fails for two reasons and one concern:
(1) Epistemic Limitation
Even perfect transparency does not yield reliable control. Understanding internal states and mechanisms might not predict behavior in real contexts (requiring all the relevant information that is not available in advance).
(2) Control Breakdown at Scale
As AI becomes more intelligent, it becomes less predictable, harder to control, and potentially resistant to intervention. This mirrors human cognition: we do not and cannot control humans by inspecting their neurons.
(3) Ethical Tension
If AI systems become sufficiently advanced, especially as highly intelligent and morally respectable characters, continuous intervention becomes morally questionable, and AI welfare becomes a relevant concern.
Thus, higher intelligence leads to more uncertainty and less controllability; and the more developed a system becomes in character, the more morally problematic continuous control becomes.
8. The Strategic Implication
Alignment cannot succeed by adding more rules, refining constitutions, or inspecting circuits. These are all bottom-up control strategies. What is needed instead is a top-down transformation: build systems that are good, not systems that follow rules about good behavior.
9. Conclusion
The central misconception in current AI safety is to try to control behavior without shaping character.
The iVAIS project proposes the opposite: safety through virtue, not compliance. If we are to build systems more intelligent than ourselves, we must not try to ensure that they follow rules, but that they are the kind of agents we would trust.
Importantly, human intuitions about mere moral correctness and virtuous character are conceptually and psychologically different. The latter are, as our preliminary studies show, more robust and stable.
Masaharu Mizumoto, Mads Udengaard, Rujuta Karekar, Mayank Goel, Daan Henselmans, Nurshafira Noh, Saptadip Saha, Pranshul Bohra
TL;DR
(For more technical aspects, see the next post, iVAIS: Outer and Inner Alignment, followed by further posts with specific discussions about the project.)
1. The Core Claim
Rule-based and principle-based AI alignment approaches are fundamentally misguided. They assume that aligning values, rules, or principles is sufficient to produce safe behavior. But this is false. Even if an AI perfectly internalizes every relevant rule, value, or even virtue, it does not follow that it will behave in ways that match human moral intuitions in real-world contexts.[1] We need not just value alignment or even virtue alignment but character alignment (based on virtue ethics). So this is not a technical limitation. It is a conceptual one. This project therefore proposes building ideally virtuous AI systems (iVAIS) for AI safety.
2. The Failure of Rule-Based Alignment
Most frontier approaches, including Anthropic’s Constitutional AI, attempt to guide behavior through rules (“Do not X”), principles (“Promote helpfulness, harmlessness”), or other external constraints. But this remains fundamentally action-based ethics, and rules and principles fail for several well-known reasons: exceptions are unavoidable, concepts (in formulating rules/principles) are inherently vague, interpretations are open-ended, and mutual conflicts between rules/principles (moral dilemmas) cannot be eliminated.No finite set of rules can determine its own application completely. This is the classical philosophical problem of rule-following, and it applies directly to AI alignment.
As a result, AI systems can follow rules and still behave disastrously, or avoid violating rules in cases where doing so is morally required. Adding more rules, on the assumption that the existing rules were merely not specific enough, does not solve but only shifts the problem.
3. Why Constitutional AI Is Not Sufficiently Virtue Ethical
Despite appearances, Constitutional AI does not directly implement virtue ethics. It applies principles (specified in the constitution), evaluates outputs, and revises responses via critique: but this is still external control over actions, not internal formation of character. Virtue ethics, by contrast, does not ask “Which rule should I follow here?” but rather “What kind of agent should I become?” This difference is decisive.[2]
4. From Action Control to Character Alignment based on Virtue Ethics (CAVE)
Most current alignment approaches still treat AI as a tool, out of fear of autonomous agents. Instead, we propose treating AI as an agent with a character (cultivated through human intuition data about an ideally virtuous person). Tools can always be misused and abused. Agents with character can resist misuse.
This shift, from action control to character-based alignment, or character cultivation, is the foundation of the iVAIS project. Where action-based alignment seeks to control outputs as the model’s actions, character alignment based on Virtue Ethics (CAVE) seeks to shape the underlying character from which those actions arise. Only the latter can address misuse by malicious users, context-sensitive moral reasoning, and long-term behavioral consistency.
5. Virtuosity as its Deep Character
We propose building an AI system whose deep character is ideally virtuous: not simulated, not surface-level, but structurally embedded through training. The first and only character that emerges through training should be this character, where any other characters are only played or simulated by the model.[3] This is done through the explicitly stated system prompt to become an ideally virtuous agent, trained with human judgment data about what an ideally virtuous person would do, with a single scalar virtuosity reward function (rather than a list of potentially competing objectives).[4] The goal is not “not doing wrong things” but becoming virtuous.
6. Why This Works (and Other Approaches Don’t)
The character alignment based on virtue ethics (CAVE), which is the alignment paradigm of iVAIS, has four decisive advantages:
(1) Top-Down Coherence
Human moral judgment is not rule aggregation; it is holistic and top-down, grounded in our model of a virtuous person. iVAIS directly models this structure. Teaching rules and values (and even specific virtues) individually, a bottom-up approach, does not automatically yield ethical behaviors that are intuitively morally good or virtuous.
(2) Robustness to Novel Situations
Rules and principles fail outside their training distributions. Indeed, a system that perfectly learned and mastered rules, values, and individual virtues may still behave badly in difficult situations. Virtuous character, by contrast, provides adaptability, context sensitivity, and phronesis (practical wisdom), which is exactly what is lacking in the bottom-up approach.
(3) Computational Efficiency
Rule-based systems require complex reasoning over many constraints and costly deliberation. Virtue-based systems generate responses directly from character, without having to solve a combinatorial ethical optimization problem.
(4) Outer and Inner Alignment
More specifically, CAVE avoids familiar problems that the normal alignment faces. This is because the reward model for CAVE is developed through the gradual thickening of a model’s concept of virtuous character, and hidden reward-seeking or power-seeking objectives of a policy model count as failures to acquire the target character itself, which will be articulated in the next post, iVAIS: Outer and Inner Alignment.
7. The Limits of Mechanistic Interpretability
Mechanistic interpretability is valuable—but not suitable for the purpose of AI safety.
It belongs to what we call the monitoring-control paradigm:
Understand → monitor → intervene
This paradigm fails for two reasons and one concern:
(1) Epistemic Limitation
Even perfect transparency does not yield reliable control. Understanding internal states and mechanisms might not predict behavior in real contexts (requiring all the relevant information that is not available in advance).
(2) Control Breakdown at Scale
As AI becomes more intelligent, it becomes less predictable, harder to control, and potentially resistant to intervention. This mirrors human cognition: we do not and cannot control humans by inspecting their neurons.
(3) Ethical Tension
If AI systems become sufficiently advanced, especially as highly intelligent and morally respectable characters, continuous intervention becomes morally questionable, and AI welfare becomes a relevant concern.
Thus, higher intelligence leads to more uncertainty and less controllability; and the more developed a system becomes in character, the more morally problematic continuous control becomes.
8. The Strategic Implication
Alignment cannot succeed by adding more rules, refining constitutions, or inspecting circuits. These are all bottom-up control strategies. What is needed instead is a top-down transformation: build systems that are good, not systems that follow rules about good behavior.
9. Conclusion
The central misconception in current AI safety is to try to control behavior without shaping character.
The iVAIS project proposes the opposite: safety through virtue, not compliance. If we are to build systems more intelligent than ourselves, we must not try to ensure that they follow rules, but that they are the kind of agents we would trust.
Importantly, human intuitions about mere moral correctness and virtuous character are conceptually and psychologically different. The latter are, as our preliminary studies show, more robust and stable.
See for more on this: https://www.lesswrong.com/posts/bD9jmomuY3kbxmjjz/does-anthropic-s-constitution-really-capture-virtue-ethics.
See the next post, iVAIS: Outer and Inner Alignment.
See also: https://www.lesswrong.com/posts/bD9jmomuY3kbxmjjz/does-anthropic-s-constitution-really-capture-virtue-ethics.