Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Note: working on a research agenda, hence the large amount of small individual posts, to have things to link to in the main documents.

In my quest to synthesise human preferences, I've occasionally been asked whether I distinguish moral preferences from other types of preferences - for example, whether preferences for Abba or Beethoven, or avocado or sausages, should rank as high as human rights or freedom of speech.

The answer is, of course not. But these are not the sort of things that should be built into the system by hand. This should be reflected in the meta-preferences. We label certain preferences "moral", and we often have the belief that these should have priority, to some extent, over merely "selfish" preferences (the extent of this belief varies from person to person, of course).

I deliberately wrote the wrong word there for this formalism - we don't have the "belief" that moral preferences are more important, we have the meta-preference that a certain class of beliefs, labelled "moral", whatever that turns out to mean, should be given greater weight. This is especially the case as there are a lot of cases where it is very unclear if a preference is moral or not (many people have strong moral-ish preferences over mainstream cultural and entertainment choices).

This is an example of the sort of challenges that a preference synthesis process should be able to figure out on its own. If the method needs to be constantly tweaked to get over every small problem of definition, then it cannot work. As always, however, it need not get everything exactly right; indeed, it needs to be robust enough that it doesn't change much if a borderline meta-preference such as "everyone should know their own history" gets labelled as moral or not.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:03 PM

Large part of what we call "moral" preferences are meta-preferences about how values of different people should be combined. For example, the freedom of speech is a preference (or norm) about how the values of different people about saying different things should be integrated. In general case, such "freedom of speech" is content free, but in real life there are situations when the preferences of a person A are so contradicting to the preferences of the person B, so the freedom of speech has to limited (e.g. hate speech, blackmail, loo long speech, defamation).

Below I will quote a few paragraphs on the topic which I wrote recently for the draft of human values.


As large part of human values are preferences about other people preferences, they mutually exclude each other. E.g.: {I want “X loves me”, but X don’t want to be influenced by other’s desires}. Such situation is typical in ordinary life, but if such values are scaled and extrapolated, one side should be chosen: either I will win, or X.

To escape such situation, something like Kantian moral low, Categorical Imperative, should be used as a metal-value, which basically regulate how other’s people values relate to each other:

Act only according to that maxim by which you can at the same time will that it should become a universal law.

In other words, Categorical Imperative is something like “updateless decision theory” in which you choose a policy without updating on your local position, so if everybody will use this principle, they will come to the same policy. (See comparison of different decision theirs developed by LessWrong community here.)

From the Categorical Imperative could be derived some human values like: it is bad to kill other people, as one doesn’t want to be killed. However, the main thing is that such meta-level principle of relation between values of different people can’t be derived just from observation of a single person.

Moreover, most ethical principles are describing interpersonal relations, so they are not about personal values, but about the ways how values of different people should interact. The things like Categorical imperative can’t be learned from observation; but they also can’t be deduced based on pure logic, so they can’t be called “true” or “false”.

In other words, AI learning human values can’t learn meta-ethical principles like Categorical imperative nor it can’t deduce them from pure math. That is why we should provide AI with correct decision theory, but it is not clear why “correct theory” should exist at all.

This could also be called meta-ethical normative assumption: some high level ethical principles which can’t be deduced from observations.