Proper value learning through indifference


16


Stuart_Armstrong

A putative new idea for AI control; index here.

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

 

Changing actions and changing values

Imagine you're an effective altruist. You donate £10 a day to whatever the top charity on Giving What We Can is (currently Against Malaria Foundation). I want to convince you to donate to Oxfam, say.

"Well," you say, "if you take over and donate £10 to AMF in my place, I'd be perfectly willing to send my donation to Oxfam instead."

"Hum," I say, because I'm a hummer. "A donation to Oxfam isn't completely worthless to you, is it? How would you value it, compared with AMF?"

"At about a tenth."

"So, if I instead donated £9 to AMF, you should be willing to switch your £10 donations to Oxfam (giving you the equivalent value of £1 to AMF), and that would be equally good as the status quo?"

Similarly, if I want to make you change jobs, I should pay you, not the value of your old job, but the difference in value between your old job and your new one (monetary value plus all other benefits). This is the point at which you are indifferent to switching or not.

Now imagine it was practically possible to change people's value. What is the price at which a consequentialist would allow their values to be changed? It's the same argument: the price at which Gandhi should accept to become a mass murderer, is the difference (according to all of Gandhi's current values) between the expected effects of current-Gandhi and the expected effects of murderer-Gandhi. At that price, he has lost nothing (and gained nothing) by going through with the deal.

Indifference is key. We want the AGI to be motivated neither to preserve their previous values, nor to change them. It's obvious why we wouldn't want the AGI to keep its values, but the obvious isn't clear - shouldn't the AGI want to make moral progress, to seek out better values?

The problem is that having an AGI that strongly desires to improve its values is a danger - we don't know how it will go about doing so, what it will see as the most efficient way to do so, and what the long term effect might be (various forms of wireheading may be a danger). To mitigate this risk, it's better to have very close control over how the AGI desires such improvement. And the best way of doing this is to have the AGI indifferent to value change, and having a separate (possibly tunable) module that regulates any positive desire towards value improvements. This gives us a much better understanding of how the AGI could behave in this regards.

So in effect we are seeking to have AGIs that apply "conservation of expected evidence" to their values - it does not benefit them to try and manipulate their values in any way. See this post for further thoughts on the matter.

 

Pay and be paid: the price of value change

The above gives an effective model for value change indifference. It's even easier with utility-function based agents that we get to design: instead of paying them with money or changes in the world, we can pay them with utility. So, if we want to shift it from utility "v" to utility "w", it has to gain the expected difference (according to its current value function, ie v) of it being a v-maximiser versus being a w-maximiser.

So we can define a meta-utility function U, consisting of a current utility function (which the agent uses to make decisions) along with a collection of constant terms. Every time an agent changes their current utility function, a new constant term is added to undo the expected effect of the change. So for instance, if an agent hears evidence that causes it to update its current utility function from v to w, then its meta-utility U changes as:

U =  v  + (Past Constants)   →

U =  w  + E(v|v→v) - E(w|v→w) + (Past Constants).

Here (Past Constants) are previous constant terms dating from previous changes of utility, v→w denotes the change of utility function v into utility function w, and v→v denotes the conterfactual where v was left unchanged. I generally prefer to define counterfactuals, when I can, by taking a stochastic process that almost always has one outcome: i.e. a process that keeps v constant with probability 1/10^100 and otherwise takes v to w. That way, conditioning on v→v is a perfectly reasonable thing to do, but v→w is the only thing that happens in practice. This formula requires that the agent assess its own future defectiveness at accomplishing certain goals, given that it has them, so is vulnerable to the usual Löbian problems.

This formula is still imperfect. A clue is that it isn't symmetric in v and w. The problem can be seen more clearly if w=v-10, for example. Then both expected terms are equal (as changing from v to v-10 has no effect on the agent's behaviour), so the agent would derive no benefit from the change, but would end up with a penalty of -10. Hence it would not want to change, which destroys the whole point. Or conversely, if w=v+10, it would desperately want to change, and would sacrifice up to 10 of value to do so.

The problem is that our initial model is incorrect. We don't want a v maximiser that is ok changing utilities to w. We want a U maximiser that is ok changing its current v to w. To achieve that, we need to tweak the formula slightly: define the agent as a U maximiser and replace the change of of utility formula with:

U =  v  + (Past Constants)   →

U =  w  + E(v|v→v) - E(w|v→w) + (Past Constants).

Here we subtract the expected value of w (given v→w) rather than the expected value of v (note that this is similar to the formula presented here, if we assume the evidence is moral evidence rather than world knowledge).

Then under this new formula, the agent is properly indifferent to changes of the type w→v±10. Indeed, it will behave exactly as a perfect v-maximiser, until such time as it transitions to w, where it will behave as a perfect w-maximiser.

This can be seen by the following situations:

  • Suppose the agent can take immediate action to increase the value of v, from 0 to a, or delay for a while and increase the value of v from 0 to b later (no other options are possible). However, in the interim period, its utility will have changed to w (and it knows this), so will no longer be a v-maximiser. What will it do?

Answer: Just as a standard v-maximiser, if a>b, it acts immediately, and if b>a, it delays (and ultimately does nothing, because it will be a w-maximiser when the opportunity arises). To see this, note that if it acts immediately, at values transition it will have E(v|v→v)=a (because it will have acted, and no further increases are possible). If it delays, at values transition it will have E(v|v→v)=b (because if it would stay a v maximiser, it will act to achieve b in the future). So it will chose depending on which of a or b is higher (the effects on w can be ignored; see the next questions).

  • Suppose the agent is currently a v-maximiser, but can take immediate actions to change the value of w (none of these actions changes the value of v). It knows it will soon be a w-maximising agent. What will it do?

Answer: It will be indifferent between all these actions. Suppose it takes actions that increases the utility w by a. Then as it transitions to a new value, w will be worth w+a. But E(w|v→w) will also go up by a. Since that term is being subtracted, the increase by a is cancelled out, and has no effect. So a U-maximising agent that is currently using v, will not choose behaviours that increases any of the future utilities it will be using.

  • Suppose the agent is currently a v-maximiser, but will soon transition to being either a w or w' maximiser. It can manipulate its environment to ensure it becomes one or the other. What will it do?

Answer: It will be indifferent between the two options, and will continue v-maximising. At the moment of transition, the E(v|v→v) term does not contain either w or w'. As for the other terms, it has the opportunity to transition to utility w-E(w|v→w) or w'-E(w|v→w'). But notice that

E(w-E(w|v→w)|it is now a w-maximising agent)

= E(w-E(w|v→w)|v→w)

= E(w|v→w)-E(w|v→w)

= 0.

Thus the expected utility increase from transitioning to w is zero. Similarly for w'. Thus the agent has no preferences between transitioning to w or to w'. Note that this formula implies that if the agent's initial utility function is u (initially, U=u), then E(U)=E(u), as each utility function change adds zero expected increase. This could be seen as the defining property of this approach: the agent cannot generate extra utility by influencing its own values.

For all these reasons, this formula seems to give a proper model of an agent that acts only on its current utility function, but has no objections to transitioning to a new one.

Note that the formula is silent on when and how utility functions should be changed. So it won't accomplish miracles: if the agent has a stupid criteria for changing its utility, it will change them in a stupid way. All that the formula accomplishes is that the agent will not attempt to manipulate its utility function changing criteria.

 

A last wrinkle...

EDIT: Paul Christiano has convinced me that the ultra-sophisticated Cake or Death problem has been solved by the above method. Yay! There may still be an issue as to what order to update in if a piece of information has both empirical and values implications. It feels like the order shouldn't matter, but I'm not totally sure.

That's the current setup for value loading. What are people's thoughts on it?