Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post will look at how model splintering can be used by an AI to extend human-specified rewards beyond its training environment, and beyond the range of what humans could do.

The key points are:

  • Most descriptive labels (eg "happiness", "human being") describe collections of correlated features, rather than fundamental concepts.
  • Model splintering is when the correlated features come apart, so that the label no longer applies so well.
  • Reward splintering is when the reward itself is defined on labels that are splintering.
  • We humans deal with these issues ourselves in various ways.
  • We may be able program an AI to deal with these in similar ways, using our feedback as needed, and extending beyond when we can no longer provide it with useful feedback.

Section 1 will use happiness as an example, defining it as a bundle of correlated features and see what happens when these start to splinter. Section 2 defines model and reward splintering in more formal terms. And finally section 3 will analyse how an AI could detect reward splintering and deal with it.

1. What is happiness? A bundle of correlated features

How might we define happiness today? Well, here's one definition:

Happiness is that feeling that comes over you when you know life is good and you can't help but smile. It's the opposite of sadness. Happiness is a sense of well-being, joy, or contentment. When people are successful, or safe, or lucky, they feel happiness.

We can also include some chemicals in the brain, such as serotonin, dopamine, or oxytocin. There are implicit necessary features there as well, that are taken as trivially true today: happiness is experienced by biological beings, that have a continuity of experience and of identity. Genuine smiles are good indicators of happiness, as is saying "I am happy" in surveys.

So, what is happiness, today? Just like rubes and bleegs, happiness is a label assigned to a bunch of correlated features. You can think of it a similar to the "g factor", an intelligence measure that is explicitly defined as a correlation of different cognitive task abilities.

And, just like rubes and bleggs, those features need not stay correlated in general environments. We can design or imagine situations where they easily come apart. People with frozen face muscles can't smile, but can certainly be happy. Would people with anterograde amnesia be truly happy? What about simple algorithms that print out "I am happy", for ever? Well, there it's a judgement call. A continuity of identity and consciousness are implicit aspects of happiness; we may decide to keep them or not. We could define the algorithm as "happy" with "happiness" expanding to cover more situations. Or we could define a new term, "simple algorithmic happiness", say, that carves out that situation. We can do the same with the anterograde amnesia (my personal instincts would be to include anterograde amnesia in a broader definition of happiness, while carving off the algorithm as something else.

Part of the reason to do that is to keep happiness as a useful descriptive term - to carve reality along its natural joints. And as reality gets more complicated or our understanding of it improves, the "natural joints" might change. For example, nationalities are much less well defined than, say, eye colour. But in terms of understanding history and human nature, nationality is a key concept, eye colour much less so. The opposite is true if we're looking at genetics. So the natural joints of reality are shifting depending on space and time, and also on the subject being addressed.

Descriptive versus prescriptive

The above looks at "happiness" as a descriptive label, and how it might come to be refined or split. There's are probably equations for how best to use labelled features in the descriptive sense, connected to the situations the AI is likely to find itself in, its own processing power and cost of computation, how useful it is for it to understand these situations, and so on.

But happiness is not just descriptive, it is often prescriptive (or normative): we would want an AI to increases happiness (among other things). So we attach value or reward labels to different state of affairs.

That makes the process much more tricky. If we say that people with anterograde amnesia don't have "true happiness", then we're not just saying that our description of happiness works better if we split it into two. We're saying that the "happiness" of those with anterograde amnesia is no longer a target for AI optimisation, i.e. that their happiness can be freely sacrificed to other goals.

There are some things we can do to extend happiness as preference/value/reward across such splintering:

  1. We can look more deeply into why we think "happiness" is important. For instance, we clearly value it as an interior state, so if "smiles" splinter from "internal feeling of happiness", we should clearly use the second.
  2. We can use our meta-preferences to extend definitions across the splintering. Consistency, respect for the individual, conservatism, simplicity, analogy with other values: these are ways we can extend the definition to new areas.
  3. When our meta-preferences become ambiguous - maybe there are multiple ways we could extend the preferences, depending on how the problem is presented to us - we might accept that multiple extrapolations are possible, and that we should take a conservative mix of them all, and accept that we'll never "learn" anything more.

We want to program an AI to be able to do that itself, checking in with us initially, but continuing beyond human capacity when we can no longer provide guidance.

2. Examples of model splintering

The AI uses a set of features that it creates and updates itself. Only one of them is assigned by us - the feature , the reward function. The AI also updates the probability distribution over these features (this defined a generalised model, ). It aims to maximise the overall reward .

When talking about , unless otherwise specified, we'll refer to whatever generalised model the AI is currently using.

We'll train the AI with an initial labelled dataset of situations; the label is the reward value for that situation. Later on, the AI may ask for a clarification (see later sections).

Basic happiness example

An AI is trained to make humans happy. Or, more precisely, it interacts with the humans sequentially, and, after the interaction, the humans click on "I'm happy with that interaction" or "I'm disappointed with that interaction".

So let be a Boolean that represents how the human clicked, let be the AI's policy, and, of course, is the reward. So .

In the training data , the AI generates a policy, or we generate a policy for it, and the human clicks on happiness and disappointment. We then assign a reward of to a click on happiness, and a reward of on a click of disappointment (thus on ). In this initial training data, the reward and are the same.

The reward extends easily to the new domain

Let's first look at a situation where model changes don't change the definition of the reward. Here the AI adds another feature , which characterises whether the human starts the interaction in a good mood. Then the AI can improve its to see how and interact with ; presumably, a good initial mood increases the chances of .

Now the AI has a better distribution for , but no reason to doubt that is equivalent with the reward .

A rewarded feature splinters

Now assume the AI adds another feature, , which checks whether the human smiles or not during the interaction; add this feature to . Then, re-running the data while looking for this feature, the AI realises that is almost perfectly equivalent with .

Here the feature on which the reward depends has splintered: it might be that determines the reward, or it might be a (slightly noisy) . Or it might be some mixture of the two.

Though the rewarded feature has splintered, the reward itself hasn't, because and are so highly correlated: maximising one of them maximises the other, on the data the AI already has.

The reward itself splinters

To splinter the reward, the AI has to experience multiple situations cases where and are no longer correlated.

For instance, can be true without if the smiling human leaves before filling out the survey (indeed, smiling is a much more reliable sign of happiness than the artificial measure of filling out a survey). Conversely, bored humans may fill out the survey randomly, giving positive without .

This is the central example of model splintering:

  • Multiple features could explain the reward in the training data . But these features are now known to come apart in more general situations.

Independent features become non-additive

Another situation with model splintering would be where the reward is defined clearly by two different features, but there is never any tension between them - and then new situations appear where they are in tension.

Let's imagine that the AI is a police officer and a social worker, and its goal is to bring happiness and enforce the law. So let where is a feature checking whether the law was enforced when it needs to be.

In the training data , there are examples with being or , while is undefined (no need to enforce any law). There are also examples with being or , while is undefined (no survey was offered). Whenever and were , the reward was , while it was if they were .

Then if the AI finds situations where both of and are defined, it doesn't know how to extend the reward, especially if their values contradict each other.

Reconvergence of multiple reward features

It's not automatically the case that as the AI learns more, rewards have to splinter more. Maybe the AI can develop another feature, , corresponding to human approval (or approval from its programmers). Then it can see and as being specific cases of - its programmers approve of humans saying they're happy, and of the law being enforced. In that case, the AI could infer a more general reward that also covers situations where and are in contradiction with each other.

Changes due to policy choices

We've been considering situations where features have become uncorrelated just because the AI has got a broader understanding of the world. But we also need to consider situations where the AI's own policy starts breaking some of the correlations that otherwise existed.

For example, we could split , enforce the law, into , a descriptive feature describing the law, and , a feature measuring whether that law is enforced.

Then in its training data, is fixed (or changes exogenously). But when the AI gets powerful, suddenly becomes malleable, dependent on its policy choices. This is another form of model splintering, one that we might want the AI to treat with extra caution and conservatism.

3. Dealing with model splintering

Detecting reward splintering

Reward splintering occurs when there are multiple ways of expressing , on the labelled data , and they lead to very different rewards in the world in general.

So we could have multiple reward functions , all consistent with over . We could define a distance function which measures how far apart and are on , and a complexity measure . Then the ''goodness of fit'' of to could be

Thus reward functions have higher fit if they are close to on the labelled data , and are simpler to express. Then define as the maximum expected value of (using to computer expectations), if the AI were to use an -maximising policy.

Then the following is always positive, and gives a good measure of the divergence between maximising the weighted mixes of the , versus maximising the individual 's:

When that metric hits a certain threshold, the AI knows that significant reward splintering has occurred.

Dealing with reward splintering: conservatism

One obvious way of dealing with reward splintering is to become conservative about the rewards.

Since human value is fragile, we would initially want to feed the AI with some specific over-conservative method of conservatism (such as smooth minimums). After learning more about our preferences, it should learn fragility of value directly, so could use a more bespoke method of conservatism[1].

Dealing with reward splintering: asking for advice

The other obvious solution is to ask humans for more reward information, and thus increase the set on which it has reward information. Ideally, the AI would ask for information that best distinguishes between different reward functions that have high but that are hard to maximise simultaneously.

When advice starts to fail

Suppose the AI could ask question , that would give it labelled data . Alternatively, it could ask question , that would give it labelled data . And suppose that and would imply very different reward functions. Suppose further that the AI could deduce the likely to occur from the question .

In that case, the AI is getting close to rigging its own learning process, essentially choosing its own reward function and getting humans to sign off on it.

The situation is not yet hopeless. The AI should prefer asking questions that are 'less manipulative' and more honest. We could feed it a dataset of questions and answers, and label some of them as manipulative and some as not. Then the AI should choose questions that have features that are closer to the non-manipulative ones.

The AI can also update its estimate of manipulation, of , by proposing (non-manipulatively - notice the recursion here) some example questions and getting labelled feedback as to whether these were manipulative or not.

When advice fails

At some point, if the AI continues to grow in power and knowledge, it will reach a point where it can get the feedback by asking question - and all the would count as "non-manipulative" according to the criteria it has in . And the problem extends to itself - it knows that it can determine future and thus future and .

At this point, there's nothing that we can teach the AI. Any lesson we could give, it already knows about, or knows it could have gotten the opposite lesson instead. It would use its own judgement to extrapolate , , , thus defining and completing the process of 'idealisation'. Much of this would be things its learnt along the way, but we might want to add an extra layer of conservatism at the end of the process.

  1. Possibly with some meta-meta-considerations that modelling human conservatism is likely to underestimate the required conservatism. ↩︎

New Comment
1 comment, sorted by Click to highlight new comments since:

The way I'm thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete "features" but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn't perfectly align with any "features" (since features are extracted from patterns in the environment), and instead you would wind up with "features" being "desirable" (correlated with reward) or "undesirable" (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring about "desirable" things rather than maximize reward per se, since the reward may not perfectly line up with anything in its ontology / predictive world-model. (Related.)

So then you sometimes have "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y".

That kinda has some spiritual similarity to model splintering I think, but I don't think it's exactly the same ... for example I don't think it even requires a distributional shift. (Or let me know if you disagree.) I don't see how to import your model splintering ideas into this kind of algorithm more faithfully than that.

Anyway, I agree with "conservatism & asking for advice". I guess I was thinking of conservatism as something like balancing good and bad aspects but weighing the bad aspects more. So maybe "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y" is actually net undesirable, because the Y outweighs the X, after getting boosted up by the conservatism correction curve.

And as for asking for advice, I was thinking, if you get human feedback about this specific thing, then after you get the advice it would pattern-match 100% to desirable feature Z, and that outweighs everything else.

As for "when advice fails", I do think you ultimately need some kind of corrigibility, but earlier on there could be something like "the algorithm that chooses when to ask questions and what questions to ask does not share the same desires as the algorithm that makes other types of decisions", maybe.