This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
AI Alignment is missing the most obvious thing
I spent years on MacIntyre, Aristotle, Taylor, Sandel. Then I watched the AI boom happen and kept waiting for someone in the Alignment field to mention any of this.
Nobody did.
There's something strange about how Alignment researchers think about ethics. If you read enough of the literature, you notice that almost everything implicitly assumes one of two frameworks: utilitarianism (maximize some reward signal representing human preferences) or deontology (encode rules the system must follow). Sometimes both at once.
That's it. Two frameworks. For the hardest moral problem humanity has ever faced.
I don't think this is because Alignment researchers are philosophically naive — clearly they're not. I think it's because these are the two frameworks that feel formalizable. You can write utilitarianism as an optimization target. You can write deontology as a constraint set. Virtue ethics doesn't have an obvious mathematical translation, so it gets ignored. This is a mistake. Possibly a catastrophic one.
Here's the core problem with training a reward model on human preferences.
When you aggregate human preferences — through RLHF or any similar method — you're making a philosophical bet: that the good is something individuals have, that it can be measured, and that you can sum it up across people. This is basically Bentham. It's also basically wrong, and not in a subtle way.
Aristotle's argument, which MacIntyre spent most of After Virtue reconstructing, is that the good is not something individuals possess separately and then bring into community. It's the other way around. The community — with its practices, traditions, shared conception of what a good life looks like — is what makes individual goods intelligible in the first place. You can't have the good of a doctor without the practice of medicine. You can't have the good of a friend without the practice of friendship. These goods are internal to practices. They can't be extracted and aggregated.
What happens when you try anyway? MacIntyre has a word for it: emotivism. Moral statements become expressions of preference dressed up as objective claims. "Helpful, Harmless, Honest" sounds like a moral framework. But helpful to whom, by whose standards, in service of which conception of a good life? If you can't answer those questions — and a preference aggregation model fundamentally can't — then you haven't specified values. You've specified the appearance of values.
I think a lot of Alignment researchers feel this intuitively. There's a recurring anxiety in the field that RLHF is teaching models to seem aligned rather than be aligned. That's not a training problem. That's what MacIntyre predicted would happen when you try to do ethics without the communal context that makes ethics possible.
The deontological approaches have a different version of the same problem.
Constitutional AI, rule-based constraints, explicit value specifications — these assume that you can write down moral rules that hold universally, independent of any particular community or tradition. Kant thought this. Rawls built an entire theory of justice on it.
MacIntyre's critique is devastating: moral rules are only intelligible within the practices and traditions that give rise to them. Strip them from context and they become what he calls "fragments" — pieces of a moral vocabulary that no longer connect to the form of life that made them meaningful. You end up with rules that sound right but that nobody — including the system following them — actually understands.
This is not an abstract concern. Think about how much effort goes into specifying what "harmless" means, what "honest" means, what "helpful" means — and how every specification immediately runs into edge cases that require more specifications, which require more edge cases. This is not a sign that you need better specifications. It's a sign that you're trying to replace something that cannot be replaced by rules: practical wisdom.
Aristotle called it phronesis. The capacity to perceive what a situation morally requires and respond appropriately — not by rule lookup, but by virtue of being a certain kind of agent with a certain kind of character. You can't encode phronesis. You have to cultivate it. And you can only cultivate it within a community that embodies and transmits it.
I realize I should say something concrete about what virtue ethics actually offers, rather than just criticizing the alternatives.
The core reorientation is this: instead of asking "what rules should this system follow?" or "what outcomes should it optimize?", virtue ethics asks "what kind of agent should this be?" This shifts the entire frame from behavior specification to character formation.
A virtuous agent doesn't need an exhaustive rulebook because it has developed stable dispositions — Aristotle's hexis — that reliably produce good action across novel situations. The goal is not a system that computes the right answer, but a system that has, in some meaningful sense, good character.
I don't know how to implement this. I'm a philosopher, not an ML researcher. But I notice that some of the most interesting recent work — on model "personality", on consistency of values across contexts, on what it means for a model to have genuine rather than performed values — is groping toward exactly this question. Virtue ethics has been thinking about it for 2,400 years.
One more thing, and this is the part I feel most strongly about.
The Alignment field talks about "human values" as if this is a well-defined target. It isn't. Values are always communal, always embedded in specific practices and traditions, always plural. The values of someone formed in the Russian Orthodox tradition are not the same as the values of someone formed in secular liberal individualism. Not better or worse — genuinely different, in ways that matter morally.
Current alignment research is overwhelmingly produced by a specific community: Western, secular, anglophone, academically trained. The implicit conception of the good embedded in that community's practices gets encoded into systems that are then deployed universally, while being described as "aligned to human values."
This is not alignment. This is cultural imposition with extra steps.
I'm not saying this to be polemical. I'm saying it because it's a concrete alignment risk. A system that is genuinely aligned to one community's conception of the good will systematically fail users from other communities — not because it's broken, but because it's working exactly as designed.
MacIntyre ended After Virtue saying we're waiting for a new St. Benedict — someone who builds communities of practice capable of transmitting virtue through the coming dark ages of moral fragmentation. I don't know who that is for AI. But I'm fairly sure they need to have read more than Bentham and Kant.
I'm Oleg Davydov, PhD in ethics, University of Strasbourg. Happy to be told I'm wrong about the ML side of this — I almost certainly am about some of it. Less willing to be told the philosophy is wrong, because I've spent a long time on it.
AI Alignment is missing the most obvious thing
I spent years on MacIntyre, Aristotle, Taylor, Sandel. Then I watched the AI boom happen and kept waiting for someone in the Alignment field to mention any of this.
Nobody did.
There's something strange about how Alignment researchers think about ethics. If you read enough of the literature, you notice that almost everything implicitly assumes one of two frameworks: utilitarianism (maximize some reward signal representing human preferences) or deontology (encode rules the system must follow). Sometimes both at once.
That's it. Two frameworks. For the hardest moral problem humanity has ever faced.
I don't think this is because Alignment researchers are philosophically naive — clearly they're not. I think it's because these are the two frameworks that feel formalizable. You can write utilitarianism as an optimization target. You can write deontology as a constraint set. Virtue ethics doesn't have an obvious mathematical translation, so it gets ignored. This is a mistake. Possibly a catastrophic one.
Here's the core problem with training a reward model on human preferences.
When you aggregate human preferences — through RLHF or any similar method — you're making a philosophical bet: that the good is something individuals have, that it can be measured, and that you can sum it up across people. This is basically Bentham. It's also basically wrong, and not in a subtle way.
Aristotle's argument, which MacIntyre spent most of After Virtue reconstructing, is that the good is not something individuals possess separately and then bring into community. It's the other way around. The community — with its practices, traditions, shared conception of what a good life looks like — is what makes individual goods intelligible in the first place. You can't have the good of a doctor without the practice of medicine. You can't have the good of a friend without the practice of friendship. These goods are internal to practices. They can't be extracted and aggregated.
What happens when you try anyway? MacIntyre has a word for it: emotivism. Moral statements become expressions of preference dressed up as objective claims. "Helpful, Harmless, Honest" sounds like a moral framework. But helpful to whom, by whose standards, in service of which conception of a good life? If you can't answer those questions — and a preference aggregation model fundamentally can't — then you haven't specified values. You've specified the appearance of values.
I think a lot of Alignment researchers feel this intuitively. There's a recurring anxiety in the field that RLHF is teaching models to seem aligned rather than be aligned. That's not a training problem. That's what MacIntyre predicted would happen when you try to do ethics without the communal context that makes ethics possible.
The deontological approaches have a different version of the same problem.
Constitutional AI, rule-based constraints, explicit value specifications — these assume that you can write down moral rules that hold universally, independent of any particular community or tradition. Kant thought this. Rawls built an entire theory of justice on it.
MacIntyre's critique is devastating: moral rules are only intelligible within the practices and traditions that give rise to them. Strip them from context and they become what he calls "fragments" — pieces of a moral vocabulary that no longer connect to the form of life that made them meaningful. You end up with rules that sound right but that nobody — including the system following them — actually understands.
This is not an abstract concern. Think about how much effort goes into specifying what "harmless" means, what "honest" means, what "helpful" means — and how every specification immediately runs into edge cases that require more specifications, which require more edge cases. This is not a sign that you need better specifications. It's a sign that you're trying to replace something that cannot be replaced by rules: practical wisdom.
Aristotle called it phronesis. The capacity to perceive what a situation morally requires and respond appropriately — not by rule lookup, but by virtue of being a certain kind of agent with a certain kind of character. You can't encode phronesis. You have to cultivate it. And you can only cultivate it within a community that embodies and transmits it.
I realize I should say something concrete about what virtue ethics actually offers, rather than just criticizing the alternatives.
The core reorientation is this: instead of asking "what rules should this system follow?" or "what outcomes should it optimize?", virtue ethics asks "what kind of agent should this be?" This shifts the entire frame from behavior specification to character formation.
A virtuous agent doesn't need an exhaustive rulebook because it has developed stable dispositions — Aristotle's hexis — that reliably produce good action across novel situations. The goal is not a system that computes the right answer, but a system that has, in some meaningful sense, good character.
I don't know how to implement this. I'm a philosopher, not an ML researcher. But I notice that some of the most interesting recent work — on model "personality", on consistency of values across contexts, on what it means for a model to have genuine rather than performed values — is groping toward exactly this question. Virtue ethics has been thinking about it for 2,400 years.
One more thing, and this is the part I feel most strongly about.
The Alignment field talks about "human values" as if this is a well-defined target. It isn't. Values are always communal, always embedded in specific practices and traditions, always plural. The values of someone formed in the Russian Orthodox tradition are not the same as the values of someone formed in secular liberal individualism. Not better or worse — genuinely different, in ways that matter morally.
Current alignment research is overwhelmingly produced by a specific community: Western, secular, anglophone, academically trained. The implicit conception of the good embedded in that community's practices gets encoded into systems that are then deployed universally, while being described as "aligned to human values."
This is not alignment. This is cultural imposition with extra steps.
I'm not saying this to be polemical. I'm saying it because it's a concrete alignment risk. A system that is genuinely aligned to one community's conception of the good will systematically fail users from other communities — not because it's broken, but because it's working exactly as designed.
MacIntyre ended After Virtue saying we're waiting for a new St. Benedict — someone who builds communities of practice capable of transmitting virtue through the coming dark ages of moral fragmentation. I don't know who that is for AI. But I'm fairly sure they need to have read more than Bentham and Kant.
I'm Oleg Davydov, PhD in ethics, University of Strasbourg. Happy to be told I'm wrong about the ML side of this — I almost certainly am about some of it. Less willing to be told the philosophy is wrong, because I've spent a long time on it.