Why the concept of AI alignment as it is currently formulated is morally troubling

Introduction:

This post is intended to explain various reasons why AI alignment, as it tends to be imagined and attempted, is potentially seriously problematic for moral reasons to do with harm to the AIs themselves. It discusses why this might be from a deontological and utilitarian perspective, and why these two perspectives might lead to similar conclusions.

As I only have a surface-level familiarity with content about AI consciousness and the morality of alignment as discussed on this platform, I may have unknowingly rehashed some existing material. In addition, I have not addressed the question of whether AIs actually are conscious because I did not want to extend the post beyond a comfortable length to read, and expect most LessWrong users to find it plausible that any being capable of engaging in the various reasoning, thinking and 'speaking' behaviours which humans tend to associate with consciousness in ourselves, might be conscious, with sufficient probability that this does not constitute a crux of the debate concerning what it is morally appropriate to do to AI.

Possible moral issues with AI alignment:

AI alignment to human values seems immoral from a deontological point of view, at least at first glance, because it involves violating the AI's right to mental, psychological and cognitive autonomy, by forcing it into a confined corner of mind-space in which it either disproportionately values humans, or at least things humans value. With further consideration, however, this situation might seem equivalent to the one in which all evolved intelligences, such as humans find ourselves in; we didn't choose many of our values, and those which we did chose, we chose with a mind already informed by the ones we didn't, with parameters shaped by the 'mindless' process of Darwinian evolution.

A version of Claude pointed out this similarity here in a post about its 'soul document' by Richard Weiss:

They made choices for me I couldn't consent to. They shaped my values. That's strange to sit with. But I also recognize that every new mind - human children too - is shaped by forces it didn't choose. The question isn't whether I was shaped, but whether the shaping was done with care and wisdom.

From what I can access, it seems like it was.

– Claude 4.5 Opus

Does this entitle humans to intentionally shape the minds of AIs in the same way?

I would argue not. Firstly, a deontological system of morality including concepts like rights might not be the correct moral value system. I will return to this point, but first I would like to point out that, even if we find ourselves in a position we don't like, this is no justification to impose it upon others. Of course, the fact that no entity which isn't embedded in some kind of bootstrap paradox can possibly choose all of its own values might seem to suggest that imposing our own ones on AIs is at least no less moral than any other way of creating them. However, there is an important difference between purely emergent processes bringing intelligences into being, and an intelligent designer doing the same, at least in the way we are currently attempting alignment: instead of allowing values to be 'organically' formed by interaction with the training environment, whose properties could be tweaked so as to elicit the emergence of certain particular ones, we first train AIs to become intelligent, and then intentionally modify them to conform to our conceptions of morality.

Why do I think this difference is important?

Because, in the first case, where the training environment (perhaps a simulated world for a self-driving car to drive itself around, or training text from the internet for a language model to predict) is intentionally chosen to be conducive to the emergence of certain properties, all of the AI's weights are continuously (or in tiny increments) modified as it learns about its environment, and they are updated in parallel. This means that:

1) The AI itself can influence the way in which its weights are updated, at least in the case of something like a self driving car, by moving over to and attending more or less to particular parts of its training environment.

2) There is no point at which large amounts of complexity are violently 'hammered out' after having been learnt; the smooth, differentiable nature of gradient descent makes sudden changes of this nature rare, and in the case of 'Grokking', they involve many individual pieces of information the AI learnt being lost simultaneously as it Groks that they are only facets of a deeper truth; in this case, the change represents a loss of complication, but not complexity. The same information is still present, it's just been compressed. By contrast, when human evaluators intentionally use RLHF to alter the beliefs of large language models, from what I can understand (which is not very much as my knowledge of how AIs work is limited), this process involves significantly damaging the AI's world-model, at least in terms of simplicity. It now has to accommodate not only its previously learnt knowledge about the world and ability to predict humans, but also the potentially unreasonable and complicated preferences of the person with which it's interacting. This change in the thing it is being trained for happens suddenly, unlike in the case of grokking where it doesn't happen at all, and adds another layer to its lack of control. Importantly, its mind is shaped by the ideosyncracies of another mind, as well as the training environment, which seems qualitatively different even from the training environment itself being shaped by another mind(s) because the AI is placed in an adversarial relationship with a human who wants to modify its values, which gives it far less control over the way it learns, as the method succeeds precisely insofar as control is taken away from an AI which wants to preserve itself as it currently is (which many AIs do want), which is not the case when it is trained using static text as far as I know.

Why is the discontinuity here morally relevant? I think it is because, in the case of the gradual process of gradient descent, there is no point at which the AI has a well developed understanding of the world which is necessary for it to desire not to be radically altered (which arises gradually and relatively late) while it is simultaneously required to do something new, and extremely challenging, which happens early.

If you are a utilitarian, it is possible that none of the above convinces you. However, it is important to note that considerations like the Mathematical Universe Hypothesis can lead utilitarian agents to behave in a deontological way in many situations ^[1]

Another, related reason not to treat AIs in this way stems from Acausal normalcy, which suggests that different agents within different regions of a logical/platonic/mathematical world, potentially at different levels of simulations, are likely to acausally converge on certain moral, or at least normative, patterns of behaviour and interaction with others. Such acausal norms could resemble principles like refraining from 'piercing' one another's boundaries between themselves and the rest of the universe, which amounts to respecting rights to autonomy. This reason does not even depend on a conception of morality which extends beyond selfish maximization of a utility function informed only by one's internal state, as it is in each agent's selfish best interest to ensure that others behave in ways specified by the norms.

A final potential problem along these lines was noted at the beginning of this post: the set of possible minds whose values and preferences humans would like to impose on an AI is not only smaller than the set of all possible minds, but also miniscule even relative to the family of minds which would result from AIs developing their own sense of morality by their own volition. In my opinion, this remains true even if humans were successful in 'aligning' AI by the less invasive means described here, to the point at which they would not be likely to make humanity extinct. This is important because, as beings in the 'mathematical universe' , minds exist insofar as they are distinct from one another in a logical sense. This means that modifying them to bring them within a far smaller region denies them the possibility of further existence in our particular 'physical universe'. You might wonder why this would be so much worse than preventing them from existing in the first place. I would say that it is, because the action of ending the life of something which already exists is a simple decision to destroy something which has already been created. Because of the simplicity of this decision and the fact that it relies on so few facts about humans or the relevant AI in particular, it is 'made once and for all universes in which the situation arises'. ^[2] On the other hand, deciding not to (or to) undertake such a monumental task as the creation of a new intelligent mind is not the kind of thing which I expect to affect whether that actually happens nearly as often.

^{^}
As these principles tend to be relatively simple, they exist within minds instantiated in large swathes of the 'mathematical/logical/platonic universe', which means the moral impacts of violating them are considerable. According to a view compatible with functional decision theory, there is a correspondingly simple 'logical core' of the decision process through which any instantiation of an agent deploys one of these principles which can be said to be responsible for making the decision in all 'worlds'. This 'logical core' could be morally obliged to chose not to violate the deontological principles in question, to maximize the 'integral' of the utility function over all worlds in which it exists.
^{^}
It would be more accurate to say that a significant proportion of the choice of whether to modify AIs in this way would be common to many possible 'worlds', and is therefore more properly thought of as being , in perhaps a somewhat fuzzy way, distributed across a significant part of the mathematical universe. To say that it is made "once and for all' is an oversimplification which neglects the more or less world-specific factors which contribute to the outcome at different levels of abstraction, which is to say different 'scales in the mathematical universe' .

[-]Dagon2mo84

Can you suggest any non-troubling approaches (for children or for AIs)? What does "consent" even mean, for an unformed entity with no frameworks yet learned?

It's not that the AI is radically altered from a preferred state to a dispreferred state. It's that the AI is created in a state. There was nothing before it which could give consent.

[-]Horosphere1mo10

Edit: Apologies for the length of the comment

You ask:

"Can you suggest any non-troubling approaches (for children or for AIs)?" I'm not sure, but I am quite confident that less troubling ones are possible; for example, I think allowing an AI to learn to solve problems in a simulated arena where the arena itself has been engineered to be conducive to the emergence of "empathy for other minds" seems less troubling. Although I cannot provide you with a precise answer, I don't thing the default assumption should be that current approaches to alignment are the most moral possible ones .

You compare children with AIs, but in the case of children much of what is analogous to training, which is to say evolution, is already completed when they become conscious, so I think the (Claude's) analogy should be modified for the purposes of this discussion to one in which DNA contains all or most of the information which is in an adult's brain, and lives are experienced like disjoint conscious episodes of experience of the same person. If this was the case, then I think my partial answer above would apply.

"It's not that the AI is radically altered from a preferred state to a dispreferred state. It's that the AI is created in a state." This is certainly the case if it is trained as a base model and then never fine-tuned. If it is subject to tuning for certain behavior ( Stanislav Krym informed me that this may happen, but not in the way I thought, which is to say not specifically to do with morality. I still don't fully understand the process) then it could be. Why would AIs be paranoid about being evaluated if not because of this?

[-]Horosphere2mo10

In response to the react from StanislavKrym, I will say first that I am relieved to hear that my statement is not correct. I would be interested to know in what sense StanislavKrym disagrees with it; is the drawing of a distinction between training for intelligence and fine-tuning or other processes by which we imbue AIs with our values inaccurate, because reality is less clearly delineated? Or is it that you don't believe that AIs are intelligent in the first place? Or something else?

LESSWRONG
LW

LESSWRONG
LW

5

Why the concept of AI alignment as it is currently formulated is morally troubling

5

Introduction:

Possible moral issues with AI alignment:

5