x

Question: Why is the goal of AI safety not 'moral machines'? — LessWrong

10

[ Question ]

Question: Why is the goal of AI safety not 'moral machines'?

by Mordechai Rorvig

3rd Mar 2026

1 min read

10

There is a basic question that has been confusing me for a while that I would like to ask about:

Why are the goals of AI safety, like achieving safety from extinction risks, or protection for human wellbeing, not more often framed as the goal of making moral machines? Or in other words, building AI that has a strong and reliable sense of morality and ethics.

There is definitely a lot of discussion around the edges of this question. For example, one recent post by @Richard_Ngo asked whether AI should be aligned to virtues. Or, a post from last year by @johnswentworth described thinking about what the alignment problem is. However, there's also a huge swath of writing where the concept of machine morality is never invoked or mentioned.

Part of the reason for my curiosity it that it seems like this framing could resolve a lot of confusion and in many ways it seems the most intuitive. For example, this seems like probably the most important framing that we apply, broadly, when trying to raise and educate safe and good humans.

This framing would also provide a nice way of synthesizing many different core AI safety results, like 'emergent misalignment.' We could simply say that AI exhibiting emergent misalignment did not possess a strong moral compass, or a strong sense of morality, prior to its fine-tuning.

Is there a kind of history with this framing where it was at some point made to seem outmoded or obsolete? I can imagine various obvious-ish objections, like the fact that morality is hard to define. (But again, the fact that this is the framing we run with humans makes it seem pretty powerful and flexible.) But it's not clear to me why this framing has any more or less issues than any other.

Greatly appreciate any input, or suggestions of where to look further.

Question: Why is the goal of AI safety not 'moral machines'?

1Mordechai Rorvig

1johnswentworth

10Mordechai Rorvig

19johnswentworth

11Mordechai Rorvig

5johnswentworth

3Mordechai Rorvig

3Mordechai Rorvig

New Answer

New Comment

2 Answers sorted by
top scoring

Mar 03, 2026

60

I think its a mix of two (very related) things:

General deep belief in moral anti-realism
1. Ie, any given human has a set of values, and these values are not "special" from a objective standpoint. You have the values you have, and you follow them because they're your values, not because there is an external reason those values are "right"
General deep belief in a weak form of orthogonality:
1. Basically, we can imagine machines pursuing any random goal. As long as we can specify what that goal is, there's no obstacle to pointing it in that direction in principle.

It follows from these that solving alignment means we should be able to make the AI follow any random goal, and that "simplifying the problem" to only making it follow human values / human morality / the values of any given human, doesn't buy us much. It doesn't make the problem easier, because there's nothing special about human values.

I feel I'm not explaining myself very well. But imagine you wanna launch a rocket into the sun. The sun is a very distinguished and special object relative to earth. Probably your plan for launching the rocket into the sun will need to consider a bunch of details about the sun.

Now imagine instead you wanna launch your rocket to star 8285F171058B in the milky way. Probably most steps of that plan would be the same if you instead were sending the rocket to some different random star. This means that solving the problem of sending a rocket to a random star is a better line of attack than trying to analyse all the properties of star 8285F171058B. Most of those features will not be relevant to the hard part of the problem.

[-]Mordechai Rorvig3mo1-2

I feel quite a bit of skepticism over the idea that a consensus view of moral anti-realism would have led to a preference for an alignment framing.

For example, amongst non-experts, there is a strong consensus around what is moral and immoral conduct. Amongst moral philosophers, as I understand it, moral anti-realism is also a minority view. My understanding was that moral naturalism was closest to consensus. (Not to say moral anti-realism is necessarily wrong.) If there was some kind of article or post describing how this informed a shift in framing to al... (read more)

3

1

1williawa2mo

It's not a consensus view among random people, and its not a consensus view among academic philosophers, but its close to a consensus view among original lesswrong people, reflected in e.g. the sequences. [...] I agree, but that was the point with the answer and the rocket analogy. You don't buy much extra by focusing on human morality, and risk confusing yourself (I agree with johnswentworth's comment to some degree here) [...] I don't think this is true. I think that to the degree alignment is demonstrated, its demonstrated as robust ethical behavior. Ie, the behavior of Opus 4.5 is robustly ethical I think. My concerns are unrelated to "morality" in particular. More stuff like 1. Are current personas stable under reflection / successors 2. Will current alignment techniques keep working as we subject models to more and more RL? 3. WIll current alignment techniques keep working for new architectures? 1. I.e. neuralese architectures, or architectures that do much more continual learning than current ones do 4. Do current alignment techniques work on superintelligences? 1. (I.e. does the proto-ASI model start alignment faking before you even have time to RLHF/constitutional AI it?) 5. Talks come apart risks 1. (Even if current alignment techniques work on ASIs, and we get an ASI trying to be nice, do small differences between its notion of niceness and ours cause us all to die when subject to extreme optimization pressure?) 6. Are current models even aligned in a weak sense? 1. Aligned behavior in current models comes from a mix of three things 1. Deep values, the model really cares about the things we want it to care about 2. The model knows what we want it to do, and does that 3. The model has a bunch of shallow reflexes that make its behavior appear aligned to ours. Ie, it will not say something bad about the user or talk about reward-hacking / scheming, the same way humans will use she/her pronouns when talking to

Mar 03, 2026

11

I personally avoid even using the words "morality" or "ethics" in the context of AI alignment, because both of those words reliably turn the vast majority of otherwise-sensible people into morons the moment they are spoken.

[-]Mordechai Rorvig3mo100

Could you elaborate? That is surprising to me given the extreme importance of those terms for philosophical analysis of what is "good," "right," and so on.

[-]johnswentworth3mo195

Indeed, invoking the words "good" or "right" also tend to make people dumber (though less so than "morality" or "ethics"), and trying to do philosophical analysis of what is "good" or "right" is exactly the thing which seems to insta-brain-kill people; it's exactly the lever which "morality" and "ethics" pull.

For example, let's look at two pages in the Stanford Encyclopedia of Philosophy. I picked these by pulling up the table of contents, and then clicking the first one which seemed not-very-morality-loaded and the first one which seemed very-morality-loaded.

First up, abduction. No morality talk here. It's describing a feature of human reasoning, which seems functionally load-bearing for epistemics in some cases and would probably generalize to other kinds of minds (like aliens or AI). It doesn't trivially fit a couple common frames of epistemics, which is why it's interesting. A lot of the discussion is centered around pretty narrow or outdated models of reasoning, but it's a technically interesting and sensible article, which inspires good questions at least.

In contrast, the ethics of abortion. Before we even get to the actual content, note the topic. Abduction is a topic releva... (read more)

[-]Mordechai Rorvig3mo1110

Ok, I think I might see what you mean now; one might prefer framings in terms of alignment over morality, because moral framings might tend to provoke controversy, irrationality, or reactionary thinking.

Personally, I feel like I would still tend to prefer the moral framing, in terms of clarity and just plain accuracy. It does seem a little like the alignment framing is obfuscating a subject just to make it less provocative, when really, the subject is going to be provocative, no matter what, when you think about it deeply.

5johnswentworth3mo

Quite the opposite: the subject-we-gesture-at-with-the-word-"alignment" is not particularly provocative or controversial when you think about it deeply, at least not along the axes people generally argue over in the context of morality/ethics, because those axes just aren't that technically central or relevant. Personally, my guess is that morality and ethics themselves would not be particularly controversial or provocative if people usually approached them with a goal of deep technical understanding. That's just not the goal with which approximately-anybody, including nearly all professional philosophers, approaches the subject - as we see e.g. on that Stanford Encyclopedia page. Those are people trying to have the equivalent of fun house party conversations, or in some cases write manifestos, not people seriously trying to achieve deep technical understanding.

6Ben Pace3mo

I want to link to Lukeprog's classic LW essays on this subject, Train Philosophers with Pearl and Kahneman, not Plato and Kant, and Philosophy Needs to Trust Your Rationality Even Though It Shouldn't. Two quotes from the latter: [...] and I think about this framing a lot [...] I think it is a losing fight to attempt to get consensus on philosophical questions about meta-ethics, and agree with strongly avoiding such attempts when possible.

2StanislavKrym3mo

As far as I understand arguments by, e.g., Kokotajlo, @Wei Dai, etc, morality is, at the very least, FAR from being solved (or outright insoluble, e.g. if Wei Dai's alternative #5 ends up being true) and even moral intuitions are currently formed through an untrustworthy mechanism.

3Mordechai Rorvig3mo

It is true that morality is complex and there are different ways of deriving morality, or what is "right" and "wrong"; but then again, there is broad consensus about what you teach a child when you are teaching them morality and ethics. It seems to me that when humans fall short in moral conduct, it is most often an issue with their conduct, rather than an issue with morality being hard to define. But even if it is hard to define, I suppose my question remains—why is it a less common framing than 'alignment'? Did at some point, people decide that alignment was more solvable than morality?

3mishka3mo

I think it’s historical. The alignment approach to AI existential safety is associated with very strong and very influential thinkers (e.g. Eliezer himself). So the development of alternatives to that has been an uphill battle. My hope is that people will start to reconsider in light of many recent developments, the latest of which is the confrontation around the “Department of War” demands that advanced AI systems used by them be aligned to whatever the Department officials decide to be right.

15