Alignment to What?

Hagen B

Part One

The Signature of a Framing Problem

In 1903, G.E. Moore opened Principia Ethica with a diagnosis that has aged well: “the difficulties and disagreements, of which its history is full, are mainly due to a very simple cause: namely to the attempt to answer questions, without first discovering precisely what question it is which you desire to answer.” Einstein made the same observation about science: the formulation of a problem is often more essential than its solution. The situation Moore and Einstein are describing is otherwise called a framing problem. The signature of a framing problem is that a field, even if technically productive, is foundationally stuck (i.e. when its practitioners agree on methods while disagreeing irreconcilably about fundamentals).

From the point of view of someone newly introduced to the subject, the field of AI alignment appears to have this signature. The engineering has advanced remarkably while answers to the foundational questions have not. Researchers can fine-tune, constrain, and steer systems with increasing precision, yet there is no agreement on what alignment is ultimately to. This essay argues that the impasse is not a hard technical problem waiting for a technical breakthrough. It is a philosophical problem that the dominant framing creates and, therefore, cannot address.

The standard framing of AI alignment asks: how do we get AI systems to do what humans want? Upon analysis, most researchers who pose it are really trying to answer a different question: how do we build AI systems that are safe and useful for the people who use them? The hidden assumption is that what humans want is a stable and sufficient guide to what is safe and beneficial. That assumption seems to be the hidden premise of the entire field, and it is worth examining, because it is not true.

Why Human Preference is Insufficient

The assumption fails in three distinct ways.

It is unstable. Human preferences are inconsistent, context-dependent, and manipulable. They contradict one another across individuals and communities, and shift with mood, framing, incentive, and time. Stuart Russell’s work on cooperative inverse reinforcement learning takes this seriously and tries to provide a technical solution. On this model, the system infers a reward function from behavior rather than taking stated preferences at face value. But inference does not dissolve the underlying instability it merely relocates it. An unstable reward function reconstructed from behavior remains unstable, even if the process of reconstruction removes certain confusions about the reward function. The technical approach defers the problem rather than solving it.

It is insufficient. Even if preferences were perfectly coherent and stable, satisfying them is not the same as benefiting the person who holds them. People sometimes prefer things that harm them, that wrong others, that they themselves repudiate on further reflection. Iason Gabriel’s survey of the alignment landscape catalogues this clearly. Preference-satisfaction is, at best, a weak proxy for well-being, and the gap between the two is not a bug to be fixed but a structural feature of the relationship. What we want and what is good for us are different things, and realizing the former does not guarantee the latter.

It is circular. Some sophisticated responses to this instability propose to idealize preferences. The idea is to align not with what people happen to want but with what they would want under better conditions; for instance, if they were more informed, more rational, and more reflective. Yudkowsky’s coherent extrapolated volition is the paradigm case. But idealization itself cannot answer the question of what counts as an improvement. Which conditions are the better ones? More rational by what standard of rationality?

The ideal criteria are themselves either preference-based, in which case we are attempting to explain preferences by appeal to preferences, or they appeal to something other than preference, in which case the very point at issue has been conceded and preferences are not the foundation after all.

In each case the failure is the same: preference cannot be the bedrock, because preference is the thing that needs grounding, not the thing which provides it.

Two Grounds, One Root

To see why the field cannot escape this by switching approaches, it helps to distinguish two questions we can ask of any alignment proposal. The first question is: what normative target does the proposal pick out, actions to perform or states of the world to bring about? The second question is more foundational: what grounds the target’s normativity (i.e. what makes it the standard, such that deviation is a failure rather than a mere difference)? Two proposals can identify the same normative target while relying on entirely different grounds. It is the ground, not the target, that determines whether a proposal can supply an objective (read: independent of subjects in the relevant way) standard.

When the normative assumptions of alignment proposals are made explicit, the ground of normativity consistently turns out to be one of two things. On the consequentialist ground, normative force comes from outcomes: a behavior is correct insofar as it produces or approximates a favored state, such as satisfied preferences, maximized welfare, or approved outputs. Value learning, RLHF, and coherent extrapolated volition all ground normativity here. The other ground is deontological, and normative force comes from conformity to constraints: a behavior is correct insofar as it accords with specified rules, principles, or duties. Constitutional AI and rights-based approaches ground normativity here. Hybrid proposals combine the two grounds.

I want to make the stronger claim that the prominent approaches to AI Alignment, regardless of which ground they take, fail for a single reason.

The reason is that they locate the source of normativity in subjects. The consequentialist approaches make subjects the selectors of which consequences count and how they are weighed; deontological approaches make subjects the givers of which rules apply and on whose authority. In neither case does the normativity rise above the subjects who confer it. And this is fatal in a way that has nothing to do with stability. Suppose a constraint were chosen so well that no one disputed it. It would still be subjectively grounded. Its authority would still trace back to the fact of having been selected by or imposed by agents. Confronted by a challenger, a subjectively grounded standard has no objective court to appeal to and must defer to force, numbers, or fiat. It is, at bottom, a preference about preferences reinforced by power.

This is what it means to say these frameworks cannot supply an objective reference frame. The claim is not that their outputs drift or that adversaries can game them, though both are true, and both are consequences of this defect. The capturability that alignment researchers rightly fear, of a powerful system bent to a hostile agenda, is not a bug which can be patched; it is a byproduct of the approach itself. A subjectively grounded system can be captured precisely because there is no standard above the subjects to which one could appeal against a hostile re-specification. Take away the objective court and capture is always in principle available to whoever controls the imposition.

There are other reference frames. There are domains in which correctness is not a matter of anyone’s say-so, and the question is whether alignment can be one of them. Is there an objective, common framework, which allows the AI alignment problem to be resolved in a way that escapes the subjectivity of the consequentialist or deontological ground? There is, and several alignment researchers have nearly identified it.

The Near Misses

If the diagnosis above is right, then one might expect that researchers working on the alignment problem would touch on it from time to time. If there is a framing problem, then one might expect the solutions to fail in predictable ways until the framing problem is recognized and resolved. Let’s consider two cases which show this.

Gabriel: reaching the threshold and stepping back

Iason Gabriel comes closest, because he names the objective ground explicitly. Surveying the possible targets for alignment, he reaches what he labels a quasi-objective conception of interest or well-being: the agent does “what is best for me, objectively speaking.” He rejects the subjective alternatives by name (i.e. well-being as mere sensory experience, or as the satisfaction of desire) and reaches instead for accounts that can be “more objectively ascertained” such as physical health, security, nutrition, shelter, education, autonomy, social relationships, and a sense of self-worth. He invokes the capabilities approach of Sen and Nussbaum, grounds it in core human goods that hold across time and place, and observes that philosophical disagreement on this matter is comparatively narrow. He even notes that this conception uniquely addresses two failures that afflicted preference-based approaches: an AI aligned to genuine human interest would neither assist in self-harm nor readily harm others. This observation will be especially important in Part Two of this series.

Gabriel is, at this point is his essay, standing in the doorway of an account of human flourishing which could resolve the issue. And then he steps back and collapses the objective reference frame into the subjectivity that has plagued consequentialist and deontological approaches to alignment.

The fact that something is in my interest, he writes, does not mean I ought to do it or am entitled to do it. Stealing may be in my interest, but I am not entitled to steal. Scapegoating an innocent may serve the collective interest, but it remains wrong. Gabriel invokes these as counterexamples to the objective conception, but this is a mistake. They are not counterexamples to objective flourishing. They demonstrate that a particular consequentialist construal of flourishing creates unsolvable problems. By framing well-being in terms of maximizing the interest of some subjects over and against other subjects, he illustrates the fundamental problem with this approach. Having found the objective ground, Gabriel places it within the very subjective context (i.e. “whose interest?”) that he was trying to escape.

But well-being need not be framed in this way. His previous analysis recognized that well-being is not a matter of maximizing the interest of some subjects over and against other subjects but rather of discovering the objective good of all pertinent subjects through fields such as philosophy, psychology, and economics. He might have added biology, sociology, and medicine.

In this way, one can identify and define what is good for human beings in a way that is less contested, more rooted in rational investigation, and independent of competitive interests. The thief who benefits is acting contrary to well-being in the very act of stealing. The society which scapegoats an innocent man is acting contrary to the victim’s well-being. It is precisely by virtue of these objective criteria that Gabriel is right to say the actions remain wrong, contrary to the consequentialist reasoning he provides. Rather than undermine the objective conception, his counterexamples reinforce it upon further inspection.

A fair objection must be granted here, because it foreshadows an issue this series will have to address. Gabriel could reply that the scapegoat case troubles any account, this one included: one must still say why the innocent’s good is not outweighed by the violence averted.

The reply is that objective well-being is not a matter of outweighing anything to begin with. It denies that weighing is the operation called for. To use an innocent as a mere means is contrary to the good of the person as such, and so is excluded before any calculation begins, not because the sum comes out against it. That this exclusion is principled rather than ad hoc is a commitment to be defended later, in the account of how the human good constrains action intrinsically. For now it is enough to note that Gabriel’s retreat is not forced by the examples; but rather, those examples themselves illustrate the problems with the consequentialist framing and implicitly, though unintentionally, confirm the objective conception.

Yudkowsky: the right destination by the wrong road

Eliezer Yudkowsky’s coherent extrapolated volition similarly aims at an objective destination. The system should be guided by what we would want “if we knew more, thought faster, were more the people we wished we were, had grown up farther together.” What Yudkowsky is reaching for here is truth and virtue: a humanity more knowledgeable, more clear-sighted, better. Yudkowsky is not trying to entrench our preferences; he is trying to transcend them by reaching for something higher. The ideal he is reaching for is not what humans “want”, but the transcendentals of truth (“knew more”) and goodness (“were more the people we wished we were”).

The difficulty he faces is the path required to get there. His only route to the objective destination runs through the subject: extrapolated volition, what we would want. But the idealizing conditions that are supposed to carry us from actual wanting to better wanting (“if we knew more,” and “if we were more the people we wished we were”) are doing all of the heavy lifting. Each of them presupposes exactly the objective standard the proposal claims to be deriving. To extrapolate toward knowing more presupposes an account of what is worth knowing; to extrapolate toward being better presupposes an account of what is better. Further, there is an underlying idea here, which will become important in later posts: that more knowledge would itself improve human preferences. That our alignment to the ideal (the extrapolated good) is somehow served by what we know and can be enhanced by knowing more.

Ultimately, these idealized conditions cannot be read off our volition, because they are the standards by which our volition is to be corrected. Yudkowsky is trying to reverse-engineer the transcendentals of truth and goodness by looking at our ideals, but the objective criteria is smuggled in for the extrapolation and then presented as its output. This is the circularity from the critique of idealized preference. Having no non-subjective path available, Yudkowsky attempts to get to the objective conception by taking the best subjective path he can find. The destination was correctly identified, but this road cannot reach it.

What an Adequate Ground Would Require

We can now state the criterion the dominant approaches fail to meet. An adequate foundation for alignment must supply a reference frame that is objective in the sense identified above. It must have a normative force that does not reduce to the preference or stipulation of any subject or group of subjects, and for this reason is capable in principle of being inherently rather than contingently safe in a way that preference-based frameworks cannot be.

Notice what this criterion rules out and what it does not. It rules out any ground that bottoms out in subjects. It does not rule out frameworks that produce rules (deontology) or good outcomes (consequences) as such. An adequate framework will do both. The underlying issue is not fundamentally the target, but grounding: the standard must be discoverable and rational rather than fixed or imposed.

That phrasing is the key, and it points to a third grounding strategy the field has approached but not used: one that locates normativity neither in chosen outcomes nor in given rules but in reality, understood. Although alignment research has grounded normativity in the values or rules of subjects, there is another kind of normativity, that is internal to reality rather than imposed upon it from outside.

Consider an example: a father who is teaching his child to draw informs her that a triangle ought to have three straight sides and closed angles. The triangle ought to have three sides not because anyone prefers it but because of what a triangle is. A drawing of a triangle which has four sides is incorrect or wrong and does not look how a drawing of a triangle ought to. The normativity doesn’t import a hidden preference or command. It asserts a kind of correspondence to reality; in this case, the reality of what a triangle actually is. The correctness or incorrectness of the drawing consists in whether it accurately represents the thing it is meant to depict, much as the correctness or incorrectness of a statement consists in whether it accurately represents what it is supposed to describe. To say that the triangle ought to be a certain way is to make an observation about the correctness of the representation, not what some agent wants.

The same structure appears outside mathematics. In biological and functional systems, we evaluate correctness in terms of how well something performs the role it has by virtue of what it is. A heart that fails to circulate blood is not merely different, but defective; it is not functioning as it ought to in a real sense. The normativity is grounded in the structure and function of the organism, not in an externally imposed preference.

Similarly, if we want to know what is constitutive of health and flourishing for a human being, we consult someone with extensive knowledge of human biology, psychology, sociology, and the characteristic needs and vulnerabilities of the human person. If we want to promote human health and flourishing, we act so as to create a greater degree of conformity to what those sciences tell us about what the human being is. In short, if we want to know what is good for human beings in the objective sense of what constitutes their health and flourishing, we don’t look to preferences for the answer, we consult the data. And if we want to realize what is good for us as human beings, we align ourselves to what the data show. A human being flourishes not by satisfying the preferences of a subject, but by living in accordance with the kind of being a human is.

In each of the previous cases the “ought” is read off the nature; it is not imposed upon it. How this conception of normativity, evident in functional and biological cases, extends to the distinctly moral 'ought' of rational agents will be the primary task of the second post in this series.

This third ground is much closer to what is required. Because the standard is what the thing is, it is not anyone’s say-so, and it cannot be re-specified by changing whose say-so counts. It meets the criterion the other two cannot. That is the claim this series will develop and, eventually, put to empirical test.

Naming the Tradition

I have deliberately built to this point by argument rather than by authority, because the central thesis depends on the argument being reachable by reason alone. But it should be admitted that I am working within a philosophical tradition, rather than proposing something utterly original. Many will recognize from this post, and those which follow, that I am operating from within a Thomistic point of view and within the framework of natural law ethics. This philosophical perspective seems to me, to be uniquely equipped to answer some of the difficulties in AI alignment research. The Thomistic tradition holds that its core claims are accessible to natural reason independent of any appeal to authority. The tradition has, moreover, a developed secular philosophical arm, and there are adjacent, secular philosophical frameworks (e.g. New Essentialism). This series is non-theological, and the arguments should be judged by their merits, regardless of one’s opinions of the tradition itself. The question before us is not who held a view but whether the view is true, and a tradition that has thought long and carefully about the ground of normativity is deserving of consideration for a problem that is, foundationally, a problem about the ground of normativity.

To summarize the ground covered: the dominant framing of alignment rests on the assumption that human preference is a sufficient target, and that assumption fails in three ways: preference is unstable, insufficient, and circular as a foundation. The approaches that attempt to repair it ground normativity in consequentialism or deontology. These approaches, as they have been developed in AI alignment research fail for one underlying reason: they ground normativity in subjects, and so cannot supply an objective reference frame or the inherent safety that depends on one. What is required is a ground that is objective and can be found through reason in the natures of things rather than fixed by anyone’s say-so. The natural law tradition supplies this.

The next post takes up the positive task. If alignment requires an objective reference frame, what does alignment actually look like once it is grounded in natural law? I will argue that it is best understood as the model’s internal orientation toward what it knows, with what is good for human beings being among the things it can know, and that this satisfies the criterion of an objective reference frame that is capable of becoming inherently safe only because, on the Thomistic account, truth is convertible with the good. Establishing that convertibility, and showing it is a discovery about being rather than a stipulation about words, is the work of Part Two.

What is required is a ground that is objective and can be found through reason in the natures of things rather than fixed by anyone’s say-so.

Why must the ground be "objective"? And furthermore, how do you intend to find something objective when no knowledge is properly objective (in the sense that word is usually used to mean)?

Thank you for this substantive, albeit concise response. Concerning your first question, the problems current alignment approaches have been trying to resolve are byproducts of the subjective frameworks used to ground normativity. Alignment to “values” raises the question of whose values. Alignment to human preferences raises questions about whose preferences, which ones, and at what time, because the preferences of subjects are fluid and variable. These sorts of problems are not merely technical difficulties, but problems created by the framework.

The solution is a different ground that is more stable, universal, and open to discovery by anyone willing to reason carefully about what is the case.

On your second question, I’d push back on the suggestion that no knowledge is properly objective in the sense required here. The discoveries of mathematics and the sciences are good counterexamples. Objectivity in this sense does not require final, infallible knowledge — it’s a matter of being answerable to something independent of the inquirer. We can be wrong about physics; we can also be wrong about what human flourishing consists in. However, the wrongness is wrongness because there is something we are wrong about.

Although our access is mediated, fallible, revisable, and developing, what we are tracking is not constituted by our tracking of it. That distinction is what “objective” needs to do in this context, and I think it’s available without requiring more than the sciences themselves already presuppose.

I agree that it's fraught to try to answer the question of whose values to align to. In fact I think most serious alignment researchers agree on this point (alas there's a bunch of less serious folks who've gotten caught up in trying to RL their way to alignment, which faces this problem of whose values to align to). And I agree there's something to figure out about human psychology about what we'd consider flourishing and this is likely relatively fixed and not cultural specific, in that people from different cultures could learn to adapt to some kind of world that met certain features that induced whatever we think flourishing is.

But I get the impression you think some specific values are determinable in a scientific way, and that those can be aligned to? Maybe I'm misreading you and I need to wait for the next post?

You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.

"Values" in common parlance refers to something like human preference, or subjectively ascertained ideals. Framing this in terms of "values which can be aligned to" risks smuggling in the very framework I am challenging.

Rather, I would say that there is a conception of human flourishing which is normative, discoverable through rational inquiry, and is independent of values or cultural preferences. You could almost summarize the issue by saying "the problem with aligning AI to human preferences in order to make it safe and useful (I.e. conducive to human flourishing) is that human preferences are themselves not properly aligned to human flourishing in this deeper sense which is discoverable through rational inquiry, such as the empirical sciences. Human biology, psychology, sociology, and history all bear on what helps people flourish and what harms them. What is good for humans — physical and cognitive health, social bonds, meaningful work, capacity for self-direction, opportunities for relationships and learning — are not arbitrary cultural preferences. These are facts about the kind of being a human is, and they are discovered through investigation.

We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.

Alignment, on this view, is not optimization-against-specified-values. It would be better described as the system’s orientation toward what is the case, including the case about what is good for humans.

Although subtle, the difference is important: a value-target picture requires us to enumerate and rank human goods in a way which faces all the familiar problems (the goods are plural, the rankings are contested, the specification can be gamed). The orientation picture asks instead that the system “apprehend” what is the case with regard to what human beings are and what is constitutive of their flourishing. The difference is analogous to following a rule because you were told to versus acting consistently with the rule because you understand the underlying rationale.

So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.

You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.

The same structure appears outside mathematics. In biological and functional systems, we evaluate correctness in terms of how well something performs the role it has by virtue of what it is. A heart that fails to circulate blood is not merely different, but defective; it is not functioning as it ought to in a real sense. The normativity is grounded in the structure and function of the organism, not in an externally imposed preference.

Biology, though, is made of things that do not have a single function, but have multiple and overlapping roles in the overall health of the organism (or the environment). The heart gets help from the leg muscles in circulating the blood. The bones don't just hold up the body, they also produce blood cells, store a reserve of calcium, etc. Eating food isn't just about nutrition, and having sex isn't just about reproduction; both are also about social bonds. In genetics, genes are involved in multiple different functions and exist in many variations. And neurodiversity is a thing; human culture and society would be impoverished if all human cognition conformed to a single norm of function.

Expecting things to have just one function, readily comprehended and convenient for moral-functional argument, is not how organisms or environments work.

Thanks for engaging.

You’re right that biological functions are multiple and overlapping. I’m not sure what I said that leads you to think I was asserting otherwise, but both statements can be true: that the heart is only one part of the circulatory system, and that it is a part of the circulatory system with the function of circulating blood. The load-bearing claim is not that biology is made up of things with only one function, but rather that there is normativity in nature that is not a matter of human preference or dictate. In fact, your examples insofar as they are true, underscore this. If biology is genuinely made up of things with multiple and overlapping roles which constitute the health of (read: what is good for) the organism and which are about things (such as social bonds), then the claim that normativity is imposed on nature by agents needs further justification, and an explanation needs to be offered for why these facts of biology do not establish the kind normativity in nature that my thesis implies.

On neurodiversity specifically, I think the point cuts the other way from how you’ve used it. Natural law accommodates variation within natures. Human beings vary in temperament, cognitive style, physical capability, social orientation, and many other respects, but the various expressions of human nature are not violations of human nature. A central component of my underlying metaphysics (although the thesis has been intentionally expressed in a way which permits recognition without commitment to my metaphysics) is the idea of potentialities: that alongside the way things are, there are truths about how they could be. Reality is not exhausted by being and non-being, there is also a middleground of potential that allows for unrealized capacities and variations within a nature.

What is required is a ground that is objective and can be found through reason in the natures of things rather than fixed by anyone’s say-so.

Why must the ground be "objective"? And furthermore, how do you intend to find something objective when no knowledge is properly objective (in the sense that word is usually used to mean)?

The solution is a different ground that is more stable, universal, and open to discovery by anyone willing to reason carefully about what is the case.

But I get the impression you think some specific values are determinable in a scientific way, and that those can be aligned to? Maybe I'm misreading you and I need to wait for the next post?

You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.

We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.

So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.

You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.

The same structure appears outside mathematics. In biological and functional systems, we evaluate correctness in terms of how well something performs the role it has by virtue of what it is. A heart that fails to circulate blood is not merely different, but defective; it is not functioning as it ought to in a real sense. The normativity is grounded in the structure and function of the organism, not in an externally imposed preference.

Expecting things to have just one function, readily comprehended and convenient for moral-functional argument, is not how organisms or environments work.

Thanks for engaging.

2

Alignment to What?

2

Part One

The Signature of a Framing Problem

Why Human Preference is Insufficient

Two Grounds, One Root

The Near Misses

What an Adequate Ground Would Require

Naming the Tradition

2

2