The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini

(first published on the EA Forum)

Short Summary

This is a commentated research agenda on the metaethics and normative ethics of AGI (artificial general intelligence) value alignment. It aims to show that various crucial aspects of this interdisciplinary Herculean task hinge on
1) numerous metaethical (and metaethics-related) questions;
2) on the moral implications of different answers to these questions;
3) on how relevant actors end up deciding to take the uncertainty surrounding these (meta)ethical questions into account.

My key takeaways (for the short summary):
1) At least some of the questions about metaethics and normative ethics briefly discussed in the six main parts and listed in the two appendices merit further attention.
2) Let us take seriously Nick Bostrom’s verdict that we have to do philosophy with a deadline.
3) Let us get a better, more exhaustive understanding of what work on AGI value alignment currently does involve, and what it could involve, and what it should involve, as well as the most decision-relevant uncertainties, and how to mitigate these uncertainties, and how to handle the remainder.
4) Let us give not only the intradisciplinary work but also the interdisciplinary work that the challenge of AGI value alignment requires its due attention.

Longer Summary

(I assume reading the short summary preceded starting to read this sentence.)

Artificial general intelligence (AGI) is a form of artificial intelligence exhibiting the capacity to apply its problem-solving skills in a broadly human-comparable cross-domain fashion (as opposed to its intelligence being forever narrowly restricted to a given set of tasks).

Since AGI, although currently hypothetical, is thought of by most pertinent researchers as “eventually realised” instead of “forever mere sci-fi”, there is ample need to research and discuss the implications this emerging technology might have. One of the most important questions arising in this context is how to go about AGI value alignment. This is among the most prominent interventions for increasing AGI safety and involves researching how to make AGI(s) “aligned to humans by sharing our values”. Incidentally: What exactly this even means gives rise to foundational questions that are far from settled.

There is a beautiful Value Alignment Research Landscape map by the Future of Life Institute giving an impressive overview of the interdisciplinary nature of this Herculean task. In addition to the technical aspects of AGI value alignment, there are also socio-politico-economic as well as neuro-bio-psychological, and philosophical aspects.

Of all the literature on questions in moral philosophy that artificial intelligence gives rise to, only a small share is concerned with the intersection of metaethics and artificial general intelligence. Metaethics is the discipline concerned with higher-order questions about ethics such as what the nature of morality is, what moral goodness/badness means, whether and if so, to which extent and in which form such a thing even exists, and how we should “do ethics”. At various places throughout this text, I employ the term metaethics in a very broad sense, e.g. as overlapping with moral psychology or the philosophy of mind.

In addition to two appendices (a list of further research questions and a list of lists of further research questions), there are six main parts:

1) AGIs as Potentially Value Aligned

I briefly touched upon these questions:

I think these questions are very important, and/but do not think any of this should make us completely abandon any hope of achieving anything in the way of AGI value alignment.

2) AGIs as Potential Moral Agents and as Potential Moral Patients

Which conditions have to be met for an artificial entity to be(come) a moral agent,
i.e. an entity that, due to being capable of moral deliberation, has moral reasons to behave in some ways but not in others?
Which conditions have to be met for an artificial entity to be(come) a moral patient, i.e. an entity that should be behaved towards (by moral agents) in some ways but not in others?

As of now, we know of no entity that is widely agreed upon to be a moral agent but is widely agreed upon not to be a moral patient. The prospect of the eventual advent of some forms of artificial minds might change that. It is possible, though, that AGI will not change anything about the emptiness of this fourth category: Firstly, moral deliberation seems necessary for moral agency. Secondly, it seems possible that sentience might be necessary for (robust forms of) moral deliberation. Thus, moral agency (at least in some substantive manifestation) might perhaps be impossible without an affective first-person understanding of sentience, i.e. without experiencing sentience.

We can add to that a third premise: It seems possible that artificial sentience is not going to happen. It is thus not obvious whether any AGI will ever be(come) a moral agent in anything but at most a minimalist sense. However, none of this should make us confident that artificial moral agency (with or without moral patiency) is impossible (or possible only in some minimalist sense of moral agency) either.

More clarity on what it is that humans are doing when we are engaging in moral deliberation seems to be useful for determining what it would mean for an artificial entity to engage in moral deliberation. Closely related, getting a better understanding of necessary conditions
and of sufficient conditions for moral agency and moral patiency seems to be useful for getting a better understanding of which actual and which hypothetical artificial entities (might come to) qualify as moral agents and/or moral patients.

While the philosophy of mind is not usually thought of as being within the scope of moral philosophy, I would argue there is significant overlap between the philosophy of mind and some metaethical questions. The prospect of the eventual advent of some forms of artificial minds raises questions such as whether moral agents can exist without being moral patients and/or without being sentient and/or without being sapient. Satisfyingly answering metaethical questions related to the chances and risks of creating entities that are more ethical than its creators might thus also presuppose a better understanding of consciousness.

3) AGIs as Potentially More Ethical Than its Creators

Here, I covered three topics overlapping with metaethics that are relevant to creating AGI more ethical than its creators: The orthogonality thesis, moral truth(s), and value lock-in.

3a) The Orthogonality Thesis

The orthogonality thesis denies that we should expect higher intelligence to automatically go hand in hand with “higher/better goals”. While even some AGI experts are at least sceptical of the orthogonality thesis or the decision relevance of its implications, this is one of the few things in this text that I have an opinion about. The position I argued for is that rejecting the orthogonality thesis would be so (net-)risky that we should only do so if we were extremely confident in a universal and sufficiently strong form of internalism re moral motivation, but that we should not have this confidence. We should not assume that whatever AGI we create would feel compelled (without our having done anything to make it so) to act in accordance with the moral truths (it thinks) it is approaching. Thus, I stated that the challenge of creating AGI more ethical than its creators does not just encompass it be(com)ing better than they or we humans generally are at moral deliberation but at altogether “being good at being good”.

3b) Moral Truth(s)

I briefly explained three forms of moral scepticism but went on to argue that, to the extent that humanity is committed to arriving at or at least approximating moral truth(s), there seems to be no such thing as abandoning this mighty quest too late.

I then brought up a “family” of metaethical theories that I deem to be particularly interesting in the context of AGI value alignment: metaethical theories, universalist as well as relativist, involving idealisation (of, e.g. values).

3c) Value Lock-in

I understand a value lock-in to be a scenario where some process has rendered future changes of prevailing values of agents exerting considerable world-shaping behaviour qualitatively less feasible than before, if at all possible. There are AGI-related risks of deliberately or inadvertently locking in values that later turn out to be “wrong values” – if moral realism or moral constructivism is true and we arrive at such truth(s) – or simply values we come to wish (for whatever reason) we had not locked in. Thus, we should think that the creator(s) of the first AGI setting into stone their values (or, for that matter, some aggregation of the values of humanity at that time) is not the best course of action…

4) Unaligned AI, Self-Defeating Cosmopolitanism, and Good AI Successors

… Nor should we, on the other extreme, pursue a deliberate AGI “non-alignment”, which, given the likelihood of a sufficiently intelligent such AGI taking over and not being cosmopolitan at all, embodies what I termed self-defeating cosmopolitanism.

Questions touched upon in this part:

When is unaligned AI morally valuable? And which AIs are good successors?
Do we want to categorically rule out the possibility of any preference of any AI being given more moral weight than any conflicting preference of any human?
- If yes, how could we possibly justify this – while rejecting substratism (an emerging form of discrimination analogous to, e.g. speciesism)?
- If no, where does/might that leave us humans?

5) Lovecraftian Ethics

Considerations about the far future of Earth-originating intelligence, and Earth-originating moral patients and moral agents prompt numerous mighty questions varyingly evocative of Lovecraftian horror. It is in the nature of things that, as we continue to study consciousness and the cosmos, making scientific discoveries and exploring new philosophical questions, we will probe more and more of what I termed Lovecraftian ethics. Two examples:

Might there be forms / levels / instantiations of the morally good / bad that we (unenhanced) humans are incapable of perceiving / conceiving / caring about?
Might humanity become morally (almost) irrelevant?

What is unclear is not just what questions and answers we will encounter, but also how, if at all, our current primitive Darwinian minds can and will cope with what as of yet lies at the border of the conceivable or wholly beyond. What can already be said now, however, is this: The future of intelligence is a major determining factor of both the future of “doing ethics” and the future of (dis)value in the universe. And we can influence the future of intelligence for the better or die trying.

6) Dealing with Uncertainties

I speculated whether aiming for a universal theory of uncertainty might be a worthwhile pursuit and pointed out that there might very well be, due in part to path dependencies, an urgency in finding better answers to philosophical questions, and in turn listed some more questions that we should probably try to tackle sooner rather than later.

Given the empirical uncertainty as well as the uncertainty re normative ethics and metaethics (and meta levels above that), we would ideally lock in as few things as possible (to preserve option value) but as many as needed (to avoid catastrophes). There are at least two extremes we should avoid: earliest possible value lock-in and deliberate “non-alignment”. The most promising family of approaches for AGI value alignment seems to me to centre around ideas such as uncertainty, indirectness, plasticity, and idealisation of values / interests / desires. Broadly speaking, the best course of action appears to involve

starting with the best we have to offer – instead of insisting we know too little about ethics and are too immoral to do that
and extrapolating from that in the best possible way we can think of – instead of insisting that we need not do this since we already know enough about ethics and are already moral enough.

Additionally, I think it is crucial that more work is being done on uncertainty re normative ethics and metaethics (and meta levels above that). I consider it extremely unlikely that anyone will be morally justified in being uncompromising when it comes to solving the value definition problem, i.e. deciding which values the first AGI will have (assuming the value loading problem, the technical problem of how to do this, has been fully solved by then – and assuming, probably unjustifiably, that these are fully separate problems). There seems not to be enough time left to solve normative ethics – if this is possible without AGI (or maybe human enhancement) in the first place. To tie in with this: That current approaches to value alignment seem to be more readily compatible with consequentialism than with deontological ethics or virtue ethics seems to me like a comparably concrete and tractable problem.

I want to reiterate that this text is not an attempt at arguing at length for a definite answer to any of these questions. Rather, I imagine this text to serve as a commentated research agenda explaining its own purpose and hopefully inspiring some more people to try to tackle some of these mighty questions.

The Full Text

The full text, especially including the two appendices, is, I think, too long for the forum and is available on my blog.

Acknowledgements

This text is my deliverable of the first summer research program by the Swiss Existential Risk Initiative (CHERI). I am very grateful to the organisers of CHERI for their support. Special thanks to my CHERI-assigned mentor, François Jaquet, who helped me navigate this project. I started with but a few dozens of hours of background knowledge in metaethics (the same applies, for that matter, to AGI), which is partly why our weekly video calls and his feedbacks proved really valuable. Views and mistakes are my own.

LESSWRONG
LW