In previous posts, I've been assuming that human values are complete and consistent, but, finally, we are ready to deal with actual human values/preferences/rewards - the whole contradictory, messy, incoherent mess of them.

Define "completely resolving human values", as an AI extracting a consistent human reward function from the inconsistent data on the preference and values of a single human (leaving aside the easier case of resolving conflicts between different humans). This post will look at how such resolutions could be done - or at least propose an initial attempt, to be improved upon.

**EDIT: There is a problem with rendering some of the LaTeX, which I don't understand. The draft rendered it fine, but not the published version. So I've replaced some LaTeX with unicode or image files; it generally works, but there are oversized images in section 3.**

## Adequate versus elegant

Part of the problem is resolving human values, is that people have been looking to do it too well and too elegantly. This results in either complete resolutions that ignore vast parts of the values (eg hedonic utilitarianism), or in thorough analyses of a tiny part of the problem (eg all the papers published on the trolley problem).

Incomplete resolutions are not sufficient to guide an AI, and elegant complete resolutions seem to be like most utopias: not any good for real humans.

Much better to aim for an *adequate* complete resolution. Adequacy means two things here:

- It doesn't lead to disastrous outcomes, according to the human's current values.
- If a human has a strong value or meta-value, that will strongly influence the ultimate human reward function, unless their other values point strongly in the opposite direction.

Aiming for adequacy is quite freeing, allowing you to go ahead and construct a resolution, which can then be tweaked and improved upon. It also opens up a whole new space of possible solutions. And, last but not least, any attempt to formalise and write a solution gives a much better understanding of the problem.

# Basic framework, then modifications

This post is a first attempt at constructing such an adequate complete resolution. Some of the details will remain to be filled in, others will doubtlessly be changed; nevertheless, this first attempt should be instructive.

The resolution will be built in three steps:

- a) It will provide a basic framework for resolving low level values, or meta-values of the same "level".
- b) It will extend this framework to account for some types of meta-values applying to lower level values.
- c) It will then allow some meta-values to modify the whole framework.

Finally, the post will conclude with some types of meta-values that are hard to integrate into this framework.

# 1 Terminology and basic concepts

Let be a human, whose "true" values we are trying elucidate. Let be the possible environments (including its transition rules), with the actual environment. And let be the set of future histories that the human may encounter, from time onward (the human's *past* history is seen as part of the environment).

Let be a set of rewards. We'll assume that is closed under many operations - affine transformations (including negation), adding two rewards together, multiplying them together, and so on. For simplicity, assume that is a real vector space, generated by a finite number of basis rewards.

Then define to be a set of potential values of . This is defined to be all the value/preference/reward statements that might agree to, more or less strongly.

## 1.1 The role of the AI

The AI's role is elucidate how much the human actually accepts statements in (see for instance here and here). For any given , it will compute , the weight of the value . For mental calibration purposes, assume that is in the to range, and that if the human has no current opinion on , then is zero (the converse is not true: could be zero because the human has carefully analysed but found it to be irrelevant or negative).

The AI will also compute , the *endorsement* of . This measures the extent to which 'approves' or 'disapproves' of a certain reward or value (there is a reward normalisation issue which I'll elide for now).

Object level values are those which are non-zero only on rewards; ie the for which for all . To avoid the most obvious self-referential problem, any value's self-endorsement is assumed to be zero (so ). As we will see below, positively endorsing a negative reward is not the same as negatively endorsing a positive reward: does not mean the same thing as .

Then this post will attempt to define the resolution function , which maps weights, endorsements, and the environment to a single reward function. So if is the cross product of all possible weight functions, endorsement functions, and environments: