This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify ‘true’ moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people’s moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.
I found this an interesting paper, and overall I think that I agreed with a lot of it, and that I’d recommend it to people interested in the topic. My main hesitation would be that people who already know a lot about these topics might find most of what the paper says familiar. But I’d guess that even such people would probably still learn something, and that they might benefit from how the paper packages and explains the things they were already familiar with.
It also felt promising to see a paper released by one of the leading AI labs close with:
Finally, the paper has treated AI as an emerging technology. However, the development of different forms of AI is not inevitable. Technologists therefore face important choices about what they want to build and why. Given the potential for AI to profoundly affect our world, these too are salient questions for our time.
But there were a few things in the paper that I felt a bit unsure about, and two passages in particular that I want to quibble with. I’ll first quote the whole first passage and give my high-level views on it, for context, and then I'll get into specifics on that passage. I'll then quote and critique the second passage. (Note that my quibbles/critiques could be mistaken, and that I’d be interested in counterarguments people might have.)
The first passage
The passage in question is supporting the third proposition mentioned above, and goes as follows:
The goal of this section is to identify principles that can govern AI in such a way that it is aligned with human values. But before we look at the options in more detail, we need to be clear about the challenge at hand. For the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines. Rather, it is to find a way of selecting appropriate principles that is compatible with the fact that we live in a diverse world, where people hold a variety of reasonable and contrasting beliefs about value.
Taking these points in turn, it is sometimes thought that if only we could identify the true moral theory then the problem of value alignment would be solved. Moreover, some authors suggest that though we may not have succeeded in identifying such an account to date, it may be possible to do so in the future after a period of ‘long reflection’—perhaps with the assistance of more powerful AI systems (Perry, 2018). Of course, we cannot know in advance what insight AI might enable, so it is sensible to remain agnostic about the long-term value of this technology for moral philosophy. But even if it could help us answer certain questions, it is very unlikely that any single moral theory we can now point to captures the entire truth about morality. Indeed, each of the major candidates, at least within Western philosophical traditions, has strongly counterintuitive moral implications in some known situations, or else is significantly underdetermined.
Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others. For, as the philosopher John Rawls notes, human beings hold a variety of reasonable but contrasting beliefs about value. What follows from the ‘fact of reasonable pluralism’ is that even if we strongly believe we have discovered the truth about morality, it remains unlikely that we could persuade other people of this truth using evidence and reason alone (Rawls, 1999, 11-16). There would still be principled disagreement about how best to live. Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them. For powerful technologies, this quest to encode the true morality could ultimately lead to forms of domination (Ricaurte, 2019).
To avoid a situation in which some people simply impose their values on others, we need to ask a different question:
In the absence of moral agreement, is there a fair way to decide what principles AI should align with?
(Note: See here, and its comments, for more on “the long reflection”.)
Some of the claims in that passage are things I just outright agree with. And I think I’d outright agree with the whole passage if:
- more of the claims were stated as being plausible or likely (rather than as if they were certainties), and/or
- the passage provided clearer arguments for why we should believe these claims, and what the connections between the claims are (rather than seeming to take the truth of the claims, or how one implies another, as self-evident)
But as it was, the passage felt to me like it moved very fast and very confidently, in a way that gave me a sort of whiplash.
With that context in mind, I’ll now get into what I saw as the specific issues.
My first set of quibbles
I have no issue with the following claims, in themselves:
But even if it could help us answer certain questions, it is very unlikely that any single moral theory we can now point to captures the entire truth about morality. Indeed, each of the major candidates, at least within Western philosophical traditions, has strongly counterintuitive moral implications in some known situations, or else is significantly underdetermined.
But I don’t see how those claims at all support the broader point being made: i.e., that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”, and/or that we would not be able to come “to have great confidence in the truth of a single moral theory”.
Instead, it seems to me that we could believe that:
- there might be a “true or correct moral theory”
- perhaps we should find that theory and then “implement it in machines”
- perhaps we could find that theory after something like a “long reflection”
- not believe that this true/correct moral theory would align with “any single moral theory we can now point to”, and/or
- be open to the possibility that this true/correct moral theory just does have “strongly counterintuitive moral implications in some known situations”, or that it just is “significantly underdetermined”
That is, perhaps there’s a true/correct moral theory that’s substantially different to any theory we currently know of, but that could be found by an excellently implemented process of “long reflection”, and that would be an excellent thing to align our AI systems with.
To be clear, I’m not saying we necessarily should believe the above three claims. And I’m certainly not saying that we should take confident, simple versions of them as assumptions when building AGI. (Personally, I do believe all of the above three claims, but with heavy, heavy emphasis on the “might” and the “perhaps”s, and so I’d want our principles for an AGI to be open to those possibilities but definitely not to rely on them.)
What I’m saying is just that I don’t at all see how the specific premise that “it is very unlikely that any single moral theory we can now point to captures the entire truth about morality” supports the conclusion that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”.
My second set of quibbles
The following passage also seems strange to me:
Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others. For, as the philosopher John Rawls notes, human beings hold a variety of reasonable but contrasting beliefs about value. What follows from the ‘fact of reasonable pluralism’ is that even if we strongly believe we have discovered the truth about morality, it remains unlikely that we could persuade other people of this truth using evidence and reason alone (Rawls, 1999, 11-16). There would still be principled disagreement about how best to live.
Firstly, how does the empirical fact that, presently, “human beings hold a variety of reasonable but contrasting beliefs about value” strongly imply that, conditional on someone or some AI reliably identifying a true/correct moral theory, “there would still be no way of reliably communicating this truth to others”?
We’re probably talking about a very different world in this scenario where a true/correct moral theory has been reliably identified. I would assume we’d have had something like a “long reflection”, or major cognitive enhancement, or extremely advanced AI, or something like that. It seems totally plausible that what humans believe would be very different in that world.
Secondly, it does seem to me totally plausible, and perhaps likely, that even if someone had very good reason to believe they’d identified the true/correct moral theory, they still wouldn’t be able to “persuade other people of this truth using evidence and reason alone”. But that doesn’t seem certain. It doesn’t seem clear that “There would still be principled disagreement about how best to live”, or that “Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them” (emphasis added in both cases).
Again, I’d assume a world where identification of a true/correct moral theory has occurred is a very different world. It seems plausible that, in that world, people would be typically convinced of most true things just “using evidence and reason alone”.
And if that isn’t already the case by default, it seems plausible that some way around that obstacle could be developed, which we would not classify as a “form of domination” or as “imposing” values on people. For example, perhaps people’s intellectual capabilities could be raised to something approximating those of whatever entity had identified the true/correct moral theory. And/or perhaps people could be taken through something approximating the same lines of argument and evidence that had aided in the identification of that theory.
Thirdly, even if in practice this couldn’t be done - if not all people could be convinced - it seems debatable whether those people’s disagreement would be “principled disagreement”. It might make sense to view that disagreement as largely the result of lack of knowledge or faulty reasoning, to view it as wise to not try to fully factor in that disagreement when deciding what values AI systems should be aligned with.
Again, I’m not saying that I believe in the exact opposite claims to the claims the paper makes in this passage, or that we should assume that things will be easier than the paper suggests (see also this post). It just seems to me that the paper:
- takes as certain a set of obstacles that seem to me more like “plausible” or “likely” obstacles
- draws conclusions that don’t seem clearly supported from the premises given
- based on these claimed certainties and conclusions, arrives at the quite substantial and confidently made claim that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines” - a claim which I think we should very seriously think about, and perhaps take as likely, but not necessarily a claim we should take as certain
The second passage
(This is a less important point which there’s a higher chance I’m wrong about, given that I don’t have much formal philosophical training.)
The paper also discusses:
A second approach to pluralistic value alignment[, which] focuses not on the values people already agree on, but rather on the principles they would agree upon if they were placed in a position where no one could impose their view on anyone else.
And the author writes that, if this approach was taken:
there may be certain distributive principles that would be chosen to regulate advanced AI. Without knowledge of their wealth or social standing, decision-makers might oppose large gaps between AI’s beneficiaries and those who lose out from the technology. These concerns would move them in the direction of egalitarian or prioritarian principles of justice, a strong version of which would be to insist that AI must work to ensure the greatest benefit to the least well off. To meet this condition in a global context, AI would need to benefit the world’s poorest people before it could be said to be value-aligned.
This seems to me to fit a common pattern of people thinking you need egalitarianism or prioritarianism to arrive at a conclusion that you can really arrive at with just standard utilitarianism, given the purely empirical fact that there’s diminishing marginal utility to many resources.
For example, if I know that the same amount of money is more valuable for the poor than for the rich, that not being a slave is far more valuable to a slave than having a slave is to a “free man”, etc., then standard utilitarianism would lead me to work to ensure particular focus on benefitting the least well off. I wouldn’t need to be an egalitarian or prioritarian to reach that conclusion.
Indeed, that sort of logic has led a lot of roughly utilitarian EAs to focus primarily on helping the extremely poor, farm animals, wild animals, people suffering from mental health issues, etc. These EAs recognise that these groups are more disadvantaged, and thus that a given amount of resources can benefit them more than it can benefit the relatively well-off (generally speaking), and that alone is enough to indicate that one should perhaps focus on helping these groups.
So I think that, if I was purely self-interested and behind a veil of ignorance, I’d want society to be set up along roughly utilitarian lines, rather than specifically along prioritarian or egalitarian lines. I don’t think I’d want “the worst off” to be given extreme priority, beyond what utilitarianism would give them, because that’d make me lose out too much if I don’t end up in that position.
(From memory, Moral Tribes by Joshua Greene discusses this sort of general point very well.)
Firstly, I should restate that I thought this was an interesting paper, and that overall I think that I agreed with a lot of it and that I’d recommend it to people interested in the topic. I’ve disproportionately focused on what I didn’t agree with about this paper, largely because I have little to add regarding the various points I did agree with.
Secondly, I should note that there’s a pretty impressive set of people in the “Acknowledgements” section, including people who seem to me very intelligent and worth paying attention to the views of. This updates me a little towards thinking my critiques are somehow just mistaken. (E.g., Joshua Greene, who I mentioned as supporting/informing one of my quibbles, is listed there.)
Thirdly, I’m aware that my first two quibbles seem very related to various discussions on LessWrong and elsewhere about whether a sufficiently powerful AI would necessarily discover “the moral truth”, or whether the “the moral truth” is intrinsically convincing. And I think this is also related to debates about internalism vs externalism, though I don’t know much about that. I haven’t explicitly discussed those debates here because:
- I believe the paper itself didn’t do so
- I don’t think doing so is necessary to support my modest claims that, basically, in a few places things like “this is true” or “therefore...” should be replaced by things like “this may be, or is probably, true” or “this might suggest that…”
Commentary on commentary
I might try to make a habit of writing reviews/commentaries/whatever as I read articles (e.g., this one). (Not counting articles that started on or are already linked to on LessWrong or the EA Forum, as in those cases I can just write comments.) The aims of this would be to:
Prompt me to more explicitly think through my vague sense of “this is very clever” or “something’s not quite right here”
Bring interesting articles to other people’s attention
Maybe productively change the beliefs of others or myself (e.g., through comments pushing back against my commentary)
- Ideally, this would involve the authors directly engaging with the reviews, though I’m guessing that’d be fairly rare
I guess I'll see over time how valuable that seems to be. I think it also might be cool for more others to do that sort of thing more often (I'm aware that some people already do).
Although note that the paper doesn’t make that claim explicitly. And it does seem true that, if decision-makers started at various points other than utilitarianism, the “concerns” the paper notes would move them "in the direction of" egalitarian or prioritarian principles. ↩︎