I think its a mix of two (very related) things:
It follows from these that solving alignment means we should be able to make the AI follow any random goal, and that "simplifying the problem" to only making it follow human values / human morality / the values of any given human, doesn't buy us much. It doesn't make the problem easier, because there's nothing special about human values.
I feel I'm not explaining myself very well. But imagine you wanna launch a rocket into the sun. The sun is a very distinguished and special object relative to earth. Probably your plan for launching the rocket into the sun will need to consider a bunch of details about the sun.
Now imagine instead you wanna launch your rocket to star 8285F171058B in the milky way. Probably most steps of that plan would be the same if you instead were sending the rocket to some different random star. This means that solving the problem of sending a rocket to a random star is a better line of attack than trying to analyse all the properties of star 8285F171058B. Most of those features will not be relevant to the hard part of the problem.
I feel quite a bit of skepticism over the idea that a consensus view of moral anti-realism would have led to a preference for an alignment framing.
For example, amongst non-experts, there is a strong consensus around what is moral and immoral conduct. Amongst moral philosophers, as I understand it, moral anti-realism is also a minority view. My understanding was that moral naturalism was closest to consensus. (Not to say moral anti-realism is necessarily wrong.) If there was some kind of article or post describing how this informed a shift in framing to al...
I personally avoid even using the words "morality" or "ethics" in the context of AI alignment, because both of those words reliably turn the vast majority of otherwise-sensible people into morons the moment they are spoken.
Could you elaborate? That is surprising to me given the extreme importance of those terms for philosophical analysis of what is "good," "right," and so on.
Indeed, invoking the words "good" or "right" also tend to make people dumber (though less so than "morality" or "ethics"), and trying to do philosophical analysis of what is "good" or "right" is exactly the thing which seems to insta-brain-kill people; it's exactly the lever which "morality" and "ethics" pull.
For example, let's look at two pages in the Stanford Encyclopedia of Philosophy. I picked these by pulling up the table of contents, and then clicking the first one which seemed not-very-morality-loaded and the first one which seemed very-morality-loaded.
First up, abduction. No morality talk here. It's describing a feature of human reasoning, which seems functionally load-bearing for epistemics in some cases and would probably generalize to other kinds of minds (like aliens or AI). It doesn't trivially fit a couple common frames of epistemics, which is why it's interesting. A lot of the discussion is centered around pretty narrow or outdated models of reasoning, but it's a technically interesting and sensible article, which inspires good questions at least.
In contrast, the ethics of abortion. Before we even get to the actual content, note the topic. Abduction is a topic releva...
Ok, I think I might see what you mean now; one might prefer framings in terms of alignment over morality, because moral framings might tend to provoke controversy, irrationality, or reactionary thinking.
Personally, I feel like I would still tend to prefer the moral framing, in terms of clarity and just plain accuracy. It does seem a little like the alignment framing is obfuscating a subject just to make it less provocative, when really, the subject is going to be provocative, no matter what, when you think about it deeply.
In some sense, Anthropic’s Constitutional AI approach is trying to point at moral or ethical machines in an informal fashion. So one could argue that “moral machines” approaches are not so rare.
One might argue that “collaborative AI” approaches or “equal rights” approaches are trying to point at moral or ethical machines to some extent as well.
One occasionally sees some attempts at more formal frameworks, for example, attempts to base AI safety on some version of “ethical rationalism” for agents, e.g. on the “Principle of Generic Consistency” (https://en.wikipedia.org/wiki/Alan_Gewirth#Ethical_theory). And people are trying to bring modal logic into play in this context, and so on.
Obviously, one needs to evaluate each particular approach separately in terms of whether it is likely to work well (and, in particular, whether it is likely to “hold water” during “recursive self-improvement” and drastic self-modifications and self-restructuring of the world; that’s where things are particularly challenging).
Yeah, I can see the morality goal being manifested indirectly in those cases. Interesting that you mention Anthropic. The thought had crossed my mind that one of the reasons that the alignment framing might have become more popular is because of undue influence from corporations, which might have intentionally sought to reframe the safety problem as one of building AI aligned to their goals (e.g., profit maximization), rather than building moral AI that would be safe, but not profitable. Although, admittedly, that feels like a somewhat deranged conspiracy theory that I have no evidence for.
Alignment is very attractive pragmatically, e.g. alignment to the user. But then what if what user wants is unsafe? Then one starts to consider “alignment hierarchies” (e.g. the LLM maker’s constraints should override, and so on).
But superintelligent systems can’t be safely aligned to arbitrary desires of people. The more one ponders this, the more clear it is. People are just not competent enough to handle supercapabilities. There are various ways one can try to salvage “alignment” as the core; e.g. to consider alignment to “coherent extrapolated volition of humanity”, but that has its own difficulties. Ilya at some point has redefined “alignment” as something minimalistic (the lack of a catastrophic blow-up), basically keeping the word, but drastically curtailing its meaning: https://www.lesswrong.com/posts/TpKktHS8GszgmMw4B/ilya-sutskever-s-thoughts-on-ai-safety-july-2023-a.
But yes, with “alignment” meaning so many different things (https://www.lesswrong.com/posts/ZKeNbGBf36ZEgDEKD/types-and-degrees-of-alignment), I would advocate to decouple AI existential safety from it. Alignment approaches form an important subclass of possible approaches to AI existential safety. And we should consider all promising approaches and not just a subclass.
There is a basic question that has been confusing me for a while that I would like to ask about:
Why are the goals of AI safety, like achieving safety from extinction risks, or protection for human wellbeing, not more often framed as the goal of making moral machines? Or in other words, building AI that has a strong and reliable sense of morality and ethics.
There is definitely a lot of discussion around the edges of this question. For example, one recent post by @Richard_Ngo asked whether AI should be aligned to virtues. Or, a post from last year by @johnswentworth described thinking about what the alignment problem is. However, there's also a huge swath of writing where the concept of machine morality is never invoked or mentioned.
Part of the reason for my curiosity it that it seems like this framing could resolve a lot of confusion and in many ways it seems the most intuitive. For example, this seems like probably the most important framing that we apply, broadly, when trying to raise and educate safe and good humans.
This framing would also provide a nice way of synthesizing many different core AI safety results, like 'emergent misalignment.' We could simply say that AI exhibiting emergent misalignment did not possess a strong moral compass, or a strong sense of morality, prior to its fine-tuning.
Is there a kind of history with this framing where it was at some point made to seem outmoded or obsolete? I can imagine various obvious-ish objections, like the fact that morality is hard to define. (But again, the fact that this is the framing we run with humans makes it seem pretty powerful and flexible.) But it's not clear to me why this framing has any more or less issues than any other.
Greatly appreciate any input, or suggestions of where to look further.