Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

Epistemic status: MSFP blog post day. General and very speculative ideas.

Proposition : Deconfusing metaethics might be a promising way to increase our chances of solving AI alignment.

What do I mean by metaethics?

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories. Let's have a look at mathematics for an analogy. Part of the mathematical production involves using some theory of logic to prove or disprove some conjecture about some mathematical object. Theorems, lemmas and properties that one can derive from axioms working with some logic is, roughly, part of how mathematics progresses. Another analogy is how we learn about regularities in the world by approximating Solomonoff induction. It seems that we are lacking some formalised, ideal rational procedure of ethical progress that would help us with sorting and generating ethical theories. Such a procedure seems difficult to figure out and potentially crucial to help solving AI alignment.

Why could this be important?

A better understanding of metaethics could help us decide among different ethical theories and how to generate new ones. Furthermore, knowing what the world should become and how AI should interact with it might requires us to make progress on how we should think about ethics to enlighten how we could think about aligning AI. For example, aligning AI with human values, learning and aggregating human preferences in some way, avoiding X-risks are all ethical propositions of what we should do. It is plausible that these views are flawed and that a better understanding of how to think about ethics might make us reconsider these normative stances and clarify what alignment means.

The following intuition is one of the main reasons why I think a better understanding of metaethics might be important to AI alignment research. As I am thinking more about ethics, arguing with others about it and getting more informed about the world, my ethical views evolve and it seems that I am making some sort of progress by sharpening my reasons for why I hold some ethical view or why some ethical theory seems flawed. Thus I tend to value more my future self's moral views to the extent that he has spent more time thinking about ethics and is more informed about the world so that I trust him more about deciding how I should go about transforming it. Similarly, it might be sensible for future AI systems to be able to instantiate a similar process of moral progress to update its utility function or goals according to the results of such a process that, if transparent and consulted by humans, could figure out how to transform the world through some long and efficient ethical reflection.

Some examples

For clarification, the following, non-exhaustive, criteria might be examples of how to evaluate ethical theories and constraints under which we could generate new ones.

Using clean thought experiments as intuition pumps.
Constraining ethical theories by physics and other scientific domains such as evolution or computability. For example, maximising the number of 10-dimensional pink unicorns is probably not a very good ethical theory as it demands to bring values that are meaningless and incompatible with the laws of physics. Science might not tell us what to do but it can help us in knowing what we can't do or can't consider as valuable.
Formalising ethical theories further than a lot of existing ethical theories that are mostly represented through natural language. This could ease the evaluation and learning of ethical theories by some AI and yield more consistent ethical theories.
Favor simplicity : avoid adding unnecessary arbitrary values.
Favor Universality : try to be as observer independent as possible. Ideally we might want our ethical theories to be applicable not just to humans but to all sorts of other physical systems.

Possible objections

This approach of AI alignment might be too top-down in its current formulation and raise a number of difficult challenges or objections toward being a research path worth pursuing :

There might be no such meaningful thing as 'better ethical theories' in the absolute sense but there might be some that are better for a certain class of physical systems.
Such a project might take too long to implement. There does not seem to be any consensus regarding a better ethical theory although philosophers have been arguing and thinking about ethics for long time.
There might not be any universally compelling argument. But we still might identify a class of arguments or ethical theories that seems more viable and use them in addition to others value learning approaches.
Formalising ethics is too hard because it's too fuzzy. Indeed ethics plausibly emerges from genetic and memetic evolution and mostly reflects humans trying to gain some value from cooperation with other humans.

Nevertheless such a project might have the positive aspect of not speeding up AI capability research while informing us about values and how to think about alignment. One important downside though would be that there might be other more promising projects to pursue instead.

Conclusion

To conclude I would like to suggest some possible way to imagine working toward a better understanding of metaethics and producing better ethical theories. These are extremely broad and vague suggestions to stimulate research ideas.

Build an ethical oracle that could be asked questions about ethical inconsistencies, moral blindspot or axiological problems.
Artificial Philosopher: Input philosophy papers on ethics and metaethics and output better understanding of metaethics and more satisfying ethical theories.
Expected utility maximiser : Update utility function in accordance to best guess about ethics for example derived from the artificial Philosopher or the ethical oracle. This would involve an additional step of translating ethical theories into utility function.
Simulate humans so that they have more time to figure out more about ethics.
Formalise ethics maybe using logic, probability and game theory.
Accelerate research in moral psychology.

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories.

That would make sense except that "metaethics" already has a different meaning in academic philosophy, namely studying what morality itself is. (See my Six Plausible Meta-Ethical Alternatives for a really quick intro to the main metaethical positions that I think are plausible.)

What you're calling "metaethics" here corresponds better to what philosophers call metaphilosophy. I've been pushing the importance of researching metaphilosophy in the context of AI alignment for a while, so it's nice to see someone reach similar conclusions independently. :) If you're interested in my thoughts on the topic, see Some Thoughts on Metaphilosophy and the posts that it links to.

Another line of thinking that's related is CEV.

(I'll probably come back and give some more detailed feedback on the rest of the content, but just wanted to fire off these quick notes for now.)

Thanks for all the useful links! I'm also always happy to receive more feedback.

I agree that the sense in which I use metaethics in this post is different from what academic philosophers usually call metaethics. I have the impression that metaethics, in academic sense, and metaphilosophy are somehow related. Studying what morality itself is, how to select ethical theories and what is the process behind ethical reasoning seems not independent. For example if moral nihilism is more plausible then it seems to be less likely that there is some meaningful feedback loop to select ethical theories or that there is such a meaningful thing as a ‘good’ ethical theory (at least in an observer independent way) . If moral emotivism is more plausible then maybe reflecting on ethics is more like emotions rationalisation, e.g. typically expressing in a sophisticated way something that just fundamentally means ‘boo suffering’. In that case having better understanding of metaethics in the academic sense seems to bring some light to a process that generates ethical theories, at least in humans.

Like Wei, I'm similarly in favor of research in this direction. I suspect we need, for example, an adequate theory of human values so that we can construct and, more importantly, verify aligned AI, but right now we are so confused about human values I'm not sure we could even tell if an AI was aligned or not.

I have a lot of developing thoughts in this area that have moved beyond what I was thinking the last time I tried to write up my thinking in this area a couple years ago. I'm not sure what I'll find time for in the coming months or if I'll solidify my ideas enough for them to be in a shareable state, but happy to talk more if you're interested in pursuing this direction.

Sure, I'm happy to read/discuss your ideas about this topic.

Essentially, your first suggestion is doing computer aided analysis on ethical theories, and proving theorems under them. Right?

I am not sure about what computer aided analysis mean but one possibility could be to have formal ethical theories and prove some theorem inside their formal framework. But this raises questions about the sort of formal framework that one could use to 'prove theorems' under ethics in a meaningful way.

Till this point, I have heard the idea of an ethics axiomatic system several times. But, no suggestion of what such axioms could be. Computer aided analysis in the sense of an automated theorem checker to search for contradictions in the system.