[Article review] Artificial Intelligence, Values, and Alignment

MichaelA

In January, DeepMind released a paper by Iason Gabriel called Artificial Intelligence, Values, and Alignment (author’s summary here; Rohin Shah's summary here). Here’s the abstract:

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify ‘true’ moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people’s moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

I found this an interesting paper, and overall I think that I agreed with a lot of it, and that I’d recommend it to people interested in the topic. My main hesitation would be that people who already know a lot about these topics might find most of what the paper says familiar. But I’d guess that even such people would probably still learn something, and that they might benefit from how the paper packages and explains the things they were already familiar with.

It also felt promising to see a paper released by one of the leading AI labs close with:

Finally, the paper has treated AI as an emerging technology. However, the development of different forms of AI is not inevitable. Technologists therefore face important choices about what they want to build and why. Given the potential for AI to profoundly affect our world, these too are salient questions for our time.

But there were a few things in the paper that I felt a bit unsure about, and two passages in particular that I want to quibble with. I’ll first quote the whole first passage and give my high-level views on it, for context, and then I'll get into specifics on that passage. I'll then quote and critique the second passage. (Note that my quibbles/critiques could be mistaken, and that I’d be interested in counterarguments people might have.)

The first passage

The passage in question is supporting the third proposition mentioned above, and goes as follows:

The goal of this section is to identify principles that can govern AI in such a way that it is aligned with human values. But before we look at the options in more detail, we need to be clear about the challenge at hand. For the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines. Rather, it is to find a way of selecting appropriate principles that is compatible with the fact that we live in a diverse world, where people hold a variety of reasonable and contrasting beliefs about value.

Taking these points in turn, it is sometimes thought that if only we could identify the true moral theory then the problem of value alignment would be solved. Moreover, some authors suggest that though we may not have succeeded in identifying such an account to date, it may be possible to do so in the future after a period of ‘long reflection’—perhaps with the assistance of more powerful AI systems (Perry, 2018). Of course, we cannot know in advance what insight AI might enable, so it is sensible to remain agnostic about the long-term value of this technology for moral philosophy. But even if it could help us answer certain questions, it is very unlikely that any single moral theory we can now point to captures the entire truth about morality. Indeed, each of the major candidates, at least within Western philosophical traditions, has strongly counterintuitive moral implications in some known situations, or else is significantly underdetermined.

Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others. For, as the philosopher John Rawls notes, human beings hold a variety of reasonable but contrasting beliefs about value. What follows from the ‘fact of reasonable pluralism’ is that even if we strongly believe we have discovered the truth about morality, it remains unlikely that we could persuade other people of this truth using evidence and reason alone (Rawls, 1999, 11-16). There would still be principled disagreement about how best to live. Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them. For powerful technologies, this quest to encode the true morality could ultimately lead to forms of domination (Ricaurte, 2019).

To avoid a situation in which some people simply impose their values on others, we need to ask a different question:

In the absence of moral agreement, is there a fair way to decide what principles AI should align with?

(Note: See here, and its comments, for more on “the long reflection”.)

Some of the claims in that passage are things I just outright agree with. And I think I’d outright agree with the whole passage if:

more of the claims were stated as being plausible or likely (rather than as if they were certainties), and/or
the passage provided clearer arguments for why we should believe these claims, and what the connections between the claims are (rather than seeming to take the truth of the claims, or how one implies another, as self-evident)

But as it was, the passage felt to me like it moved very fast and very confidently, in a way that gave me a sort of whiplash.

With that context in mind, I’ll now get into what I saw as the specific issues.

My first set of quibbles

I have no issue with the following claims, in themselves:

But even if it could help us answer certain questions, it is very unlikely that any single moral theory we can now point to captures the entire truth about morality. Indeed, each of the major candidates, at least within Western philosophical traditions, has strongly counterintuitive moral implications in some known situations, or else is significantly underdetermined.

But I don’t see how those claims at all support the broader point being made: i.e., that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”, and/or that we would not be able to come “to have great confidence in the truth of a single moral theory”.

Instead, it seems to me that we could believe that:

there might be a “true or correct moral theory”
perhaps we should find that theory and then “implement it in machines”
perhaps we could find that theory after something like a “long reflection”

...and yet:

not believe that this true/correct moral theory would align with “any single moral theory we can now point to”, and/or
be open to the possibility that this true/correct moral theory just does have “strongly counterintuitive moral implications in some known situations”, or that it just is “significantly underdetermined”

That is, perhaps there’s a true/correct moral theory that’s substantially different to any theory we currently know of, but that could be found by an excellently implemented process of “long reflection”, and that would be an excellent thing to align our AI systems with.

To be clear, I’m not saying we necessarily should believe the above three claims. And I’m certainly not saying that we should take confident, simple versions of them as assumptions when building AGI. (Personally, I do believe all of the above three claims, but with heavy, heavy emphasis on the “might” and the “perhaps”s, and so I’d want our principles for an AGI to be open to those possibilities but definitely not to rely on them.)

What I’m saying is just that I don’t at all see how the specific premise that “it is very unlikely that any single moral theory we can now point to captures the entire truth about morality” supports the conclusion that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines”.

My second set of quibbles

The following passage also seems strange to me:

Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others. For, as the philosopher John Rawls notes, human beings hold a variety of reasonable but contrasting beliefs about value. What follows from the ‘fact of reasonable pluralism’ is that even if we strongly believe we have discovered the truth about morality, it remains unlikely that we could persuade other people of this truth using evidence and reason alone (Rawls, 1999, 11-16). There would still be principled disagreement about how best to live.

Firstly, how does the empirical fact that, presently, “human beings hold a variety of reasonable but contrasting beliefs about value” strongly imply that, conditional on someone or some AI reliably identifying a true/correct moral theory, “there would still be no way of reliably communicating this truth to others”?

We’re probably talking about a very different world in this scenario where a true/correct moral theory has been reliably identified. I would assume we’d have had something like a “long reflection”, or major cognitive enhancement, or extremely advanced AI, or something like that. It seems totally plausible that what humans believe would be very different in that world.

Secondly, it does seem to me totally plausible, and perhaps likely, that even if someone had very good reason to believe they’d identified the true/correct moral theory, they still wouldn’t be able to “persuade other people of this truth using evidence and reason alone”. But that doesn’t seem certain. It doesn’t seem clear that “There would still be principled disagreement about how best to live”, or that “Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them” (emphasis added in both cases).

Again, I’d assume a world where identification of a true/correct moral theory has occurred is a very different world. It seems plausible that, in that world, people would be typically convinced of most true things just “using evidence and reason alone”.

And if that isn’t already the case by default, it seems plausible that some way around that obstacle could be developed, which we would not classify as a “form of domination” or as “imposing” values on people. For example, perhaps people’s intellectual capabilities could be raised to something approximating those of whatever entity had identified the true/correct moral theory. And/or perhaps people could be taken through something approximating the same lines of argument and evidence that had aided in the identification of that theory.

Thirdly, even if in practice this couldn’t be done - if not all people could be convinced - it seems debatable whether those people’s disagreement would be “principled disagreement”. It might make sense to view that disagreement as largely the result of lack of knowledge or faulty reasoning, to view it as wise to not try to fully factor in that disagreement when deciding what values AI systems should be aligned with.

Again, I’m not saying that I believe in the exact opposite claims to the claims the paper makes in this passage, or that we should assume that things will be easier than the paper suggests (see also this post). It just seems to me that the paper:

takes as certain a set of obstacles that seem to me more like “plausible” or “likely” obstacles
draws conclusions that don’t seem clearly supported from the premises given
based on these claimed certainties and conclusions, arrives at the quite substantial and confidently made claim that “the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines” - a claim which I think we should very seriously think about, and perhaps take as likely, but not necessarily a claim we should take as certain

The second passage

(This is a less important point which there’s a higher chance I’m wrong about, given that I don’t have much formal philosophical training.)

The paper also discusses:

A second approach to pluralistic value alignment[, which] focuses not on the values people already agree on, but rather on the principles they would agree upon if they were placed in a position where no one could impose their view on anyone else.

And the author writes that, if this approach was taken:

there may be certain distributive principles that would be chosen to regulate advanced AI. Without knowledge of their wealth or social standing, decision-makers might oppose large gaps between AI’s beneficiaries and those who lose out from the technology. These concerns would move them in the direction of egalitarian or prioritarian principles of justice, a strong version of which would be to insist that AI must work to ensure the greatest benefit to the least well off. To meet this condition in a global context, AI would need to benefit the world’s poorest people before it could be said to be value-aligned.

This seems to me to fit a common pattern of people thinking you need egalitarianism or prioritarianism to arrive at a conclusion that you can really arrive at with just standard utilitarianism, given the purely empirical fact that there’s diminishing marginal utility to many resources.^[1]

For example, if I know that the same amount of money is more valuable for the poor than for the rich, that not being a slave is far more valuable to a slave than having a slave is to a “free man”, etc., then standard utilitarianism would lead me to work to ensure particular focus on benefitting the least well off. I wouldn’t need to be an egalitarian or prioritarian to reach that conclusion.

Indeed, that sort of logic has led a lot of roughly utilitarian EAs to focus primarily on helping the extremely poor, farm animals, wild animals, people suffering from mental health issues, etc. These EAs recognise that these groups are more disadvantaged, and thus that a given amount of resources can benefit them more than it can benefit the relatively well-off (generally speaking), and that alone is enough to indicate that one should perhaps focus on helping these groups.

So I think that, if I was purely self-interested and behind a veil of ignorance, I’d want society to be set up along roughly utilitarian lines, rather than specifically along prioritarian or egalitarian lines. I don’t think I’d want “the worst off” to be given extreme priority, beyond what utilitarianism would give them, because that’d make me lose out too much if I don’t end up in that position.

(From memory, Moral Tribes by Joshua Greene discusses this sort of general point very well.)

Disclaimers

Firstly, I should restate that I thought this was an interesting paper, and that overall I think that I agreed with a lot of it and that I’d recommend it to people interested in the topic. I’ve disproportionately focused on what I didn’t agree with about this paper, largely because I have little to add regarding the various points I did agree with.

Secondly, I should note that there’s a pretty impressive set of people in the “Acknowledgements” section, including people who seem to me very intelligent and worth paying attention to the views of. This updates me a little towards thinking my critiques are somehow just mistaken. (E.g., Joshua Greene, who I mentioned as supporting/informing one of my quibbles, is listed there.)

Thirdly, I’m aware that my first two quibbles seem very related to various discussions on LessWrong and elsewhere about whether a sufficiently powerful AI would necessarily discover “the moral truth”, or whether the “the moral truth” is intrinsically convincing. And I think this is also related to debates about internalism vs externalism, though I don’t know much about that. I haven’t explicitly discussed those debates here because:

I believe the paper itself didn’t do so
I don’t think doing so is necessary to support my modest claims that, basically, in a few places things like “this is true” or “therefore...” should be replaced by things like “this may be, or is probably, true” or “this might suggest that…”

Commentary on commentary

I might try to make a habit of writing reviews/commentaries/whatever as I read articles (e.g., this one). (Not counting articles that started on or are already linked to on LessWrong or the EA Forum, as in those cases I can just write comments.) The aims of this would be to:

Prompt me to more explicitly think through my vague sense of “this is very clever” or “something’s not quite right here”
Bring interesting articles to other people’s attention
Maybe productively change the beliefs of others or myself (e.g., through comments pushing back against my commentary)
- Ideally, this would involve the authors directly engaging with the reviews, though I’m guessing that’d be fairly rare

I guess I'll see over time how valuable that seems to be. I think it also might be cool for more others to do that sort of thing more often (I'm aware that some people already do).

Although note that the paper doesn’t make that claim explicitly. And it does seem true that, if decision-makers started at various points other than utilitarianism, the “concerns” the paper notes would move them "in the direction of" egalitarian or prioritarian principles. ↩︎

I never came back to this paper after I briefly posted about it, and this seems as good a place as any to say more and continue the conversation.

What I found weird about this paper is that it seems to focus too much on something that seems largely irrelevant to me. I don't expect there to be much for us to choose about how to aggregate values, because I expect that most of the problem is in figuring out how to specify or find values at all. I do expect there to be some issues to resolve around aggregation, but not knowing yet what we will be aggregating (that is, what the abstractions we will be trying to deal with aggregation over and conflict resolution of) makes it hard to see how this kind of consideration is yet relevant.

To be fair, many may object that I am making the same mistake worrying about understanding what values even are and how we might be able to verify if AI are aligned with ours when we don't even know what AI powerful enough to need alignment will look like, so I wouldn't want to see this kind of work not happen, only that for my taste it seems like a premature thing to worry about that may be reasoning about things that won't be relevant or won't be relevant in the way we expect such that the work is of limited marginal value.

That said, I think this paper stands as an excellent signal, as you do, that more mainstream AI researchers are taking problems in value alignment more seriously and thinking about problems of the kind that are more likely, in my estimation, to be important long term than short term concerns about, for example, narrow value learning.

It sure seems like if he really grokked the philosophical and technical challenge of getting a GAI agent to be net beneficial, he would write a different paper. That first challenge sort of overshadows the task of dividing up the post-singularity pie.

But I'm not sure whether the overshadowing is merely by being bigger (in which case this paper is still doing useful work), or if we should expect that solutions to the pie-dividing problems (e.g. weighing egalitarianism vs. utilitarianism) will necessarily fall out of the process that lets the AI learn how to behave well.

If you buy a pizza cutter, but the pizza doesn't arrive, then you've wasted your money.

(Technically this is incorrect if you ever buy a pizza again, or there's something else you can use it to split, but I understand the main reason people have expressed concern about AGI is the belief that if it goes horribly wrong there won't be another chance to try again.)

Short:

I agree with MichaelA's questions about the paper.

Long:

Responses to quotes from the paper:

Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others.

This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated? Though this may be close to hitting the nail on the head:

how do we reliably communicate "the one true theory" to "AI"?
Perhaps given that we don't know that it can be reliably communicated, we shouldn't rely on that.

Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them

Unless the correct moral theory doesn't involve doing that? It seems like this is just changing the name of the search.

In the absence of moral agreement, is there a fair way to decide what principles AI should align with?

It's not clear what "fair" means here - but the paper might be about looking for "morality"/achieving one of its properties under a different name, as noted above.

This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated?

To be fair to the paper, I'm not sure that that specifically is as strong an argument as it might look. E.g., I don't have a proof for [some as-yet-unproven mathematical conjecture], but I feel pretty confident that if I did come up with such a proof, I wouldn't be able to reliably communicate it to just any given random person.

But note that there I'm saying "I feel pretty confident", and "I wouldn't be able to". So I think the issue is more in the "can't", and the implication that we couldn't fix that "can't" even if we tried, rather than in the fact these arguments are being applied to something we haven't discovered yet.

That said, I do think it's an interesting and valid point that the fact we haven't found that theory yet (again, assuming it exists) adds at least a small extra reason to believe it's possible we could communicate it reliably. For example, my second-hand impression is that some philosophers think "the true moral theory" would be self-evidently true, once discovered, and would be intrinsically motivating, or something like that. That seems quite unlikely to me, and I wouldn't want to rely on it at all, but I guess it is yet another reason why it's possible the theory could be reliably communicated.

And I guess even if the theory was not quite "self-evidently true" or "intrinsically motivating", it might still be shockingly simple, intuitive, and appealing, making it easier to reliably communicate than we'd otherwise expect.

Perhaps given that we don't know that it can be reliably communicated, we shouldn't rely on that.

Yes, I'd strongly agree with that. I sort-of want us to make as few assumptions on philosophical matters as possible, though I'm not really sure precisely what that means or what that looks like.

"Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them"

Unless the correct moral theory doesn't involve doing that?

To again be fair to the paper, I believe the argument is that, given the assumption (which I contest) that we definitely couldn't reliably convince everyone of the "correct moral theory", if we wanted to align an AI with that theory we'd effectively end up imposing that theory on people who didn't sign up for it.

You might have been suggesting that such an imposition might be explicitly prohibited by the correct moral theory, or something like that. But in that case, I think the problem is instead that we wouldn't be able to align the AI with that theory, without at least some contradictions, if people couldn't be convinced of the theory (which, again, I don't see as certain).

I never came back to this paper after I briefly posted about it, and this seems as good a place as any to say more and continue the conversation.

If you buy a pizza cutter, but the pizza doesn't arrive, then you've wasted your money.

Short:

I agree with MichaelA's questions about the paper.

Long:

Responses to quotes from the paper:

Furthermore, even if this were not the case and we came to have great confidence in the truth of a single moral theory, the proposed approach immediately encounters a second problem, namely that there would still be no way of reliably communicating this truth to others.

This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated? Though this may be close to hitting the nail on the head:

how do we reliably communicate "the one true theory" to "AI"?
Perhaps given that we don't know that it can be reliably communicated, we shouldn't rely on that.

Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them

Unless the correct moral theory doesn't involve doing that? It seems like this is just changing the name of the search.

In the absence of moral agreement, is there a fair way to decide what principles AI should align with?

It's not clear what "fair" means here - but the paper might be about looking for "morality"/achieving one of its properties under a different name, as noted above.

This seems incorrect - if we don't have "the one true theory" (assuming it exists), then how do we know it can't be reliably communicated?

Perhaps given that we don't know that it can be reliably communicated, we shouldn't rely on that.

Yes, I'd strongly agree with that. I sort-of want us to make as few assumptions on philosophical matters as possible, though I'm not really sure precisely what that means or what that looks like.

"Designing AI in accordance with a single moral doctrine would therefore involve imposing a set of values and judgments on other people who did not agree with them"

Unless the correct moral theory doesn't involve doing that?

13

[Article review] Artificial Intelligence, Values, and Alignment

13

The first passage

My first set of quibbles

My second set of quibbles

The second passage

Disclaimers

Commentary on commentary

13

13