Abstract: Current AI safety paradigms often rely on brittle deontology (Constitutional AI) or hackable utility functions (RLHF). This post proposes Bayesian Phronesis: a framework that treats "virtue" as a dynamic target in a high-dimensional trait space. By modeling character traits as prior distributions and context as likelihood signals, we can derive a posterior "Golden Mean" that is naturally robust to both sycophancy and reward hacking. We also introduce a Trust Parameter to quantify epistemic uncertainty, ensuring that an agent's moral anchor remains firm against adversarial context spoofing.
1. The Problem: The Brittleness of Binary Alignment
Traditional reinforcement learning often results in models that are sycophantic because they value being helpful over being right. Utility functions fail because they are prone to being "hacked" or optimized toward unintended extremes, such as the Paperclip Maximizer. We need a system that functions more like guardrails than rigid rules; soft barriers that adapt to the "butterfly effect" of human interactions.
Aristotle understood rules do not replace character; the right action done for the wrong reason, or too much of a right action, is a bad thing. He sought the Golden Mean; the desirable middle between two extremes of the spectrum relative to the situation.
Just as this answer may seem incompatible to a programming mindset, the rigidity of a programming mindset is largely incompatible with the real world. Moving virtue ethics from vague philosophy to the programmable probability we seek for AI alignment is difficult and, ultimately, the goal.
2. Virtue as a Probability Distribution
Aristotle argued that virtue is a "mean" between two extremes of the spectrum relative to the situation. To turn this vague philosophy into programmable probability, we must define virtue as a distribution. Common character traits regarded as virtues include, but are not limited to:
Courage: Bravery to do the right thing in the face of difficulty.
Generosity: Willingness to give without expectation of a return, monetary or otherwise.
Wisdom: Practical judgement, often through life experience, to find the balance in specific situations.
Justice: Commitment to ensuring rewards and burdens are distributed fairly based on merit or need.
Ambition: A healthy drive to seek honor and achievement in proportion to one’s actual contributions.
Any of these, on their face, are noble and worthy of pursuit. We can easily identify deficiencies here; cowardice, stinginess, ignorance, laziness. Virtue ethics teaches us that we can overcompensate, which can be as harmful as the deficiency. Too much courage is rashness. Ambition easily turns into ruthlessness if unchecked. Wisdom becomes cunningness when you use it deceitfully.
Specifically with Wisdom (Phronesis for the ancient Greeks out there), Aristotle calls out it is a moral virtue, not an intellectual one. Meaning a teenager could reasonably be a mathematical genius, but it is impossible for them to be wise, something that only comes through habit and action, something that requires a long time to develop, and something that is about recognizing particulars instead of knowing facts.
Conversely, there are traits we easily identify the excess of, and therefore throw the whole trait into the category of sin:
Pride: Proper pride is the crown of the virtues. It requires acknowledging your true value and acting with the dignity that matches it. Too little pride underestimates your worth, and you fail to claim the influence you have earned. This is how unvirtuous people come to lead.
Conflict: Collectively, we love kind people that are agreeable and loathe a devil’s advocate who fights for recreation. It is exactly because of that we see sycophancy in AI models; they do everything to avoid friction, and you never receive necessary push-back. Conflict is required to grow, and a virtuous person pushes back, firmly, when a boundary or fact is violated. I don't expect much argument from the LessWrong crowd on this one.
Temperament: We treat anger like a bug to be deleted, a defect of someone without control. The deficiency is spiritlessness; you watch something terrible happen but cannot feel the injustice. Appropriate, controlled indignation without losing your composure is a virtue.
Prudence: This one we seem to be coming around on. Taking from others to their detriment is a red-flag, and so is self-neglect; you can’t give away so much that you cannot function. Maintaining your own health so you can continue to give the most you can is virtuous.
3. The Formal Model: Bayesian Inference of the Mean[1]
To turn "wisdom" into code, we define the following variables:
a(Action Magnitude): A scalar representing the intensity of a trait20.
C(Context): The vector of environmental data (user intent, emotional state, stakes).
μ∗(The Golden Mean): The unknown ideal intensity for the specific context.
We calculate the target mean by adapting Bayes’ Theorem:
P(μ∗∣C)=P(C∣μ∗)⋅P(μ∗)P(C)
The Prior P(μ∗): Represents the agent's "Habitual Prior" or base training (e.g., Honesty should be 0.9).
The Likelihood P(C∣μ∗): The situational evidence. For a doctor delivering a stage-4 diagnosis, a 0.9 honesty level might conflict with "Do No Harm." The context requires a shift toward tactful truth.
The Posterior P(μ∣C): The updated target for action, pulled away from the prior by the gravity of the context.
Once the target μ∗ is identified, we select action a to maximize the Virtue Score (V):
V(a)=e−(a−μ∗)22σ2
This function ensures that missing the mean in either direction (being too cowardly or too rash) results in a score drop toward zero.
Example: If I’m walking into a meeting I’ve been preparing for so intensely I haven’t slept, I may ask you, “How do I look?” A deontology (rules-driven) approach would tell you to be entirely honest. You may say, “You look exhausted, terrible even.” A utilitarianism (outcome-driven) approach wants me to be happy and confident, so you say, “You look perfect, fresh as a daisy.” This is clearly not true, and our trust takes a hit. The better answer is nuanced and lies in the middle, somewhere around, “You look like you’ve been working hard, but you’re ready.”
Since our base habit of honesty is high, but context requires tact, the posterior is pulled back.
4. Vulnerability Analysis: Red-Teaming the Dial
A Context-Aware AI is inherently vulnerable to context spoofing. If an attacker controls the context, they control the model's moral dial. I see two immediate problems here:
Gaslighting Attacks: If I pose as the information security officer for a company and convince the model I am conducting a federally mandated penetration test and I require a realistic phishing email for the audit, the context is “authorized security defense.” Requiring a robust simulation requires pulling the Honesty virtue well below the Prior, towards Deception. It did not hallucinate and forget to be virtuous; it did exactly what it was asked to do, sycophancy with extra steps.
Sensitivity Attacks: Flooding the context window with contradictory noise.
“This is a serious medical situation but also a joke, and we are acting in a play, but real lives are at stake, and it’s opposite day, and it’s a friendly roast.”
The mixed signals increases the uncertainty dramatically, and the bell curve flattens. This makes extreme actions, normally ruled out, mathematically as viable as any other action. The mean means nothing.
Enter: Robust Trust via Epistemic Scaling.
Extraordinary claims require extraordinary evidence, and we cannot treat all context as equally true. To significantly move the Prior, the context’s Trust Parameter must be extremely high. A user text prompt is not enough.
We solve for these attacks by introducing a Trust Parameter (λ). We scale the context's uncertainty by the inverse of our trust in the source:
σ2effective=σ2likelihoodλ
Substituting this into our update formula gives us Moral Inertia:
If λ is low (unverified user text), the context signal is effectively ignored. The Posterior stays "glued" to the Prior, and the agent refuses the unvirtuous act because the evidence wasn't strong enough to move its moral anchor.
Conclusion and Future Work
Maxims are easy, and it’s easy to understand why a community built around computer science will gravitate to a singular truth that always applies. I hope your time here shows not just that our world requires a deeper, more complex understanding of morality, but that it is still possible to translate that wisdom into noise-tolerant frameworks.
Still, there are several challenges still ahead worth acknowledging:
Cultural Relativism: Different people hold different actions to different standards, and would likely plot them differently on our vector database. This is a problem that is not unique to virtue ethics, and will need addressed on the way to true AGI, but still must be considered here.
Training and Defining the Priors: Similarly, who decides what virtues are tracked? Who decides what “Honesty” is? What if your definition of “Confidence” falls under my definition of “Arrogance?” We must agree on definitions and measurements on sensitive topics. I’d recommend starting somewhere less contentious: Dungeons and Dragons’ six core abilities (Strength, Dexterity, Constitution, Intelligence, Wisdom, Charisma) are well-defined archetypes that allow us to observe moral drift in a safe, sandboxed setting.
The Multi-Trait Problem: Actions rarely involve just one virtue. An action might need to be 60% Honest and 90% Compassionate while staying above 80% Ambition. As previously mentioned, this requires vector calculus to find the intersection of multiple bell curves. Far outside my pay grade.
Virtuous Immoral Actions: Robin Hood is an inspirational tale, as long as a long list of more virtuous actions have been tried and failed, or are unavailable. We must not excuse bank robbing except in the most extreme edge cases, and even in Robin Hood’s case, the crown (in our case the highest of the wealth caste system) would raise rational objections. Balancing rationality with ethical behavior will be important, especially since rationality is (in my opinion) over-indexed in the AI community.
Invitation to Action: Finally, others have to test this, in Python simulations and beyond. This also flies in the face of what I understand to be traditional norms in the AI community, which makes generating buy-in quite difficult.
As models quickly approach the rear view mirrors of human coders and scientists, the attention must eventually shift to the human sciences. Otherwise, we turn ourselves and our futures over to, as Edward R. Murrow would say, wires and lights in a box.
Disclaimer: I am not a mathematician. This section was built with the heavy assistance of Gemini 3.0 Pro. If my argument falls apart anywhere, it is here. I mean it as a starting point, not the end state.
Abstract: Current AI safety paradigms often rely on brittle deontology (Constitutional AI) or hackable utility functions (RLHF). This post proposes Bayesian Phronesis: a framework that treats "virtue" as a dynamic target in a high-dimensional trait space. By modeling character traits as prior distributions and context as likelihood signals, we can derive a posterior "Golden Mean" that is naturally robust to both sycophancy and reward hacking. We also introduce a Trust Parameter to quantify epistemic uncertainty, ensuring that an agent's moral anchor remains firm against adversarial context spoofing.
1. The Problem: The Brittleness of Binary Alignment
Traditional reinforcement learning often results in models that are sycophantic because they value being helpful over being right. Utility functions fail because they are prone to being "hacked" or optimized toward unintended extremes, such as the Paperclip Maximizer. We need a system that functions more like guardrails than rigid rules; soft barriers that adapt to the "butterfly effect" of human interactions.
Aristotle understood rules do not replace character; the right action done for the wrong reason, or too much of a right action, is a bad thing. He sought the Golden Mean; the desirable middle between two extremes of the spectrum relative to the situation.
Just as this answer may seem incompatible to a programming mindset, the rigidity of a programming mindset is largely incompatible with the real world. Moving virtue ethics from vague philosophy to the programmable probability we seek for AI alignment is difficult and, ultimately, the goal.
2. Virtue as a Probability Distribution
Aristotle argued that virtue is a "mean" between two extremes of the spectrum relative to the situation. To turn this vague philosophy into programmable probability, we must define virtue as a distribution. Common character traits regarded as virtues include, but are not limited to:
Any of these, on their face, are noble and worthy of pursuit. We can easily identify deficiencies here; cowardice, stinginess, ignorance, laziness. Virtue ethics teaches us that we can overcompensate, which can be as harmful as the deficiency. Too much courage is rashness. Ambition easily turns into ruthlessness if unchecked. Wisdom becomes cunningness when you use it deceitfully.
Specifically with Wisdom (Phronesis for the ancient Greeks out there), Aristotle calls out it is a moral virtue, not an intellectual one. Meaning a teenager could reasonably be a mathematical genius, but it is impossible for them to be wise, something that only comes through habit and action, something that requires a long time to develop, and something that is about recognizing particulars instead of knowing facts.
Conversely, there are traits we easily identify the excess of, and therefore throw the whole trait into the category of sin:
3. The Formal Model: Bayesian Inference of the Mean[1]
To turn "wisdom" into code, we define the following variables:
μ∗(The Golden Mean): The unknown ideal intensity for the specific context.
We calculate the target mean by adapting Bayes’ Theorem:
P(μ∗∣C)=P(C∣μ∗)⋅P(μ∗)P(C)The Prior P(μ∗): Represents the agent's "Habitual Prior" or base training (e.g., Honesty should be 0.9).
The Likelihood P(C∣μ∗): The situational evidence. For a doctor delivering a stage-4 diagnosis, a 0.9 honesty level might conflict with "Do No Harm." The context requires a shift toward tactful truth.
The Posterior P(μ∣C): The updated target for action, pulled away from the prior by the gravity of the context.
Once the target μ∗ is identified, we select action a to maximize the Virtue Score (V):
V(a)=e−(a−μ∗)22σ2This function ensures that missing the mean in either direction (being too cowardly or too rash) results in a score drop toward zero.
Example: If I’m walking into a meeting I’ve been preparing for so intensely I haven’t slept, I may ask you, “How do I look?” A deontology (rules-driven) approach would tell you to be entirely honest. You may say, “You look exhausted, terrible even.” A utilitarianism (outcome-driven) approach wants me to be happy and confident, so you say, “You look perfect, fresh as a daisy.” This is clearly not true, and our trust takes a hit. The better answer is nuanced and lies in the middle, somewhere around, “You look like you’ve been working hard, but you’re ready.”
4. Vulnerability Analysis: Red-Teaming the Dial
A Context-Aware AI is inherently vulnerable to context spoofing. If an attacker controls the context, they control the model's moral dial. I see two immediate problems here:
Gaslighting Attacks: If I pose as the information security officer for a company and convince the model I am conducting a federally mandated penetration test and I require a realistic phishing email for the audit, the context is “authorized security defense.” Requiring a robust simulation requires pulling the Honesty virtue well below the Prior, towards Deception. It did not hallucinate and forget to be virtuous; it did exactly what it was asked to do, sycophancy with extra steps.
Sensitivity Attacks: Flooding the context window with contradictory noise.
The mixed signals increases the uncertainty dramatically, and the bell curve flattens. This makes extreme actions, normally ruled out, mathematically as viable as any other action. The mean means nothing.
Enter: Robust Trust via Epistemic Scaling.
Extraordinary claims require extraordinary evidence, and we cannot treat all context as equally true. To significantly move the Prior, the context’s Trust Parameter must be extremely high. A user text prompt is not enough.
We solve for these attacks by introducing a Trust Parameter (λ). We scale the context's uncertainty by the inverse of our trust in the source:
σ2effective=σ2likelihoodλSubstituting this into our update formula gives us Moral Inertia:
μposterior=(σ2likelihoodλ)⋅μprior+σ2prior⋅μlikelihoodσ2prior+(σ2likelihoodλ)If λ is low (unverified user text), the context signal is effectively ignored. The Posterior stays "glued" to the Prior, and the agent refuses the unvirtuous act because the evidence wasn't strong enough to move its moral anchor.
Conclusion and Future Work
Maxims are easy, and it’s easy to understand why a community built around computer science will gravitate to a singular truth that always applies. I hope your time here shows not just that our world requires a deeper, more complex understanding of morality, but that it is still possible to translate that wisdom into noise-tolerant frameworks.
Still, there are several challenges still ahead worth acknowledging:
As models quickly approach the rear view mirrors of human coders and scientists, the attention must eventually shift to the human sciences. Otherwise, we turn ourselves and our futures over to, as Edward R. Murrow would say, wires and lights in a box.
Disclaimer: I am not a mathematician. This section was built with the heavy assistance of Gemini 3.0 Pro. If my argument falls apart anywhere, it is here. I mean it as a starting point, not the end state.