Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have previously discussed the constructive process of generating a reward function from the mess of human preferences.

But, let's be honest, that process was somewhat messy and ad hoc. It used things like "partial preferences", "identity preferences", "one step hypotheticals", "partial meta-preferences about the synthesis process", and so on.

The main reasons for this messiness are that human preferences and meta-preferences have so many features and requirements, and many of the intuitive terms and definitions in the area are ambiguous. So I had to chunk different aspects of human preferences into simple formal categories that were imperfect fits, and deal with each category in a somewhat different way.

Here I'll try and present a much simpler system that accomplishes the same goal. To do so, I'll rely on two main ideas: model splintering and delegation to future selves - via present selves.

Model splintering allows us to defer most of the issues with underdefined concepts. So we can use those concepts to define key parts of the preference-construction process - and punt to a future 'general solution to the model splintering problem' to make that rigorous.

Delegation to future selves allows us to encode a lot of the more complicated aspects of the process to our future selves, as long as we are happy with the properties of these future selves. Again, model splintering is essential to allowing this to function well, since the relevant 'properties' of our future selves are themselves underdefined.

Then constructing the human preference function becomes an issue of energy minimisation.

The reward function algorithm

The setup

Let be a human at time . An AI will attempt to construct , a suitable reward function for that human; this process will be finished at time (when we presume the human is still around). From that moment on, the AI will maximise .

The building blocks: partial preferences

Defining partial preferences has been tricky. These are supposed to capture our internal mental models - what happens when we compare images of the world in our minds and select which one is superior.

In the feature language of model splintering, define partial preferences as follows. Let be a feature, and a set of 'background' features. Let be the set of values that can take - a possible 'range' of the features .

Then define the real number

as representing how much the human prefers over , given the background assumption that are in the range .

This is a partial preference; it is only defined, pairwise, over a subset of worlds. That subset is defined to be

Here if is defined by , by , and both have the same values of , which are in the range .

These partial preferences will only be defined for very few values of , , , , and . In the cases where is not defined, the is the empty set.

Since writing out every time is clunky, let represent all that information.

Energy minimisation

Let be a set of worlds. For two worlds , let be the number of sets that belongs to.

Then the is defined so that it minimises the following sum:

Meta-preferences and the delegation process

In a previous post, I showed how consistency requirements for reward functions were derived from preferences of the subject.

Here we will go further, and derive a lot more. The reward will be a feature of the world in which the AI operates, after . Therefore it is perfectly consistent for to have partial preferences over itself.

Since the human will still be around after , it can also express preferences over its own identity (this is where model splintering is useful, since those preferences need not be fully defined to be useful). In the future, it will also have its own future preferences, which we will designate by . This is a slight abuse of notation, since these future preferences need not be reward functions.

Then a human may have preferences over , its future preferences, and over the extent to which the AI should respect its future preferences. This sets up an energy minimisation requirement between and , which serves to address issues like the AI modifying human's preferences to bring them closer to their ideal, or the AI continuing to satisfy - or not - a future human whose preferences diverge from their current state.

So via this delegation-to-future-self-as-defined-by-the-current-agent, a lot of meta-preferences can be included into without having to consider different types.

A small consistency example

Consider a situation where has inconsistent preferences, but a strong meta-preference for consistency (of whatever type).

One way for the AI to deal with this, could be as follows. It generate , a not-very-consistent reward function. However, it ensures that the future world is only one in which does not encounter situations where the inconsistency becomes manifest. And it pushes to evolve towards , which is consistent, but actually equal to on the future world that the AI would create. So can get an energy minimisation on inconsistent preferences, but the human still inhabit a world where its preferences are consistent and the AI acts to satisfy those.

Multiple humans

In the case of multiple humans, the AI would want to construct , a global utility function for all humans. In that case, we would energy-minimise as above, but sum over all humans:

Explicitly putting our (weighty) thumb on the scales

We may want to remove anti-altruistic preferences, enforce extra consistency, enforce a stronger sense of identity (or make the delegation process stronger), insist that the preferences of future agents be intrinsically respected, and so on.

The best way to do this is to do so explicitly, by assigning weights to worlds and to the partial preferences. Then the energy minimisation formula becomes:

For the issue of future entities that don't yet exist, the simplest would probably be to delegate population ethics to the partial preferences of current humans (possibly weighted or with equality between entities required), and then use that population ethics to replace the "" in the expression above.

Likely errors?

I like the elegance of the approach, so, like with all simple and reward-maximising approaches, it makes me nervous that I've missed something big that will blow a hole in the whole process. Comments and criticisms are therefore much appreciated.

The key missing piece

Of course, this all assumes that we can solve the model splintering problem in a safe fashion. But it seems that it probably a requirement of any method of value learning, so that may not be as strong a requirement as it seems.

New Comment