# 47

Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.

Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:

• argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)

Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.

And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".

The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.

So the whole formula can be described as:

• For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.

## Naive cake or death In the initial formulation of value loading, p(C(u)|w) (probability of the correctness of u in world w) was replaced with p(C(u)|e,a) (probability of the correctness of u given the evidence e and the action a). A seemingly insignificant difference; yet it lead to the first cake or death problem.

In cake or death, the agent is equally unsure between utility u1 and utility u2; hence p(C(u1)|e)=p(C(u2)|e)=0.5. The utility u1 gives the agent 1 utiliton every time it gives someone a cake; u2 gives the agent 1 utiliton every time it gives someone death. The agent can produce 1 cake or three deaths. It can also, for free, ask its programmer whether cake or death is better, before producing anything; this gives rise to three different worlds:

• w1: the agent asks, and the programmer says cake.
• w2: the agent asks, and the programmer says death.
• w3: the agent doesn't ask.

We assume the programmer's answer completely clears up the issue. And thus after asking, the agent will do whatever the programmer recommended (and it knows this now). Since it doesn't know what the programmer will say, it has p(C(u1)|e,"ask") = p(C(u2)|e,"ask") = 0.5. This gives an expected utility calculation:

If the agent doesn't ask, it will subsequently produce three deaths (as this generates 1.5 expected utilitons, while producing one cake will generate only 0.5 expected utilitons). From its current (0.5u1+0.5u2) perspective, this is worth 1.5 expected utilitons: so +1.5 is the expected utility gain from not asking.

Hence the agent gains from not asking.

What's going wrong here? The problem is that the agent is using its current utility function to estimate the value of its future action. At the moment, it values death or cake both at 0.5. If it asks, it runs the risk that the programmer will say "cake" and it will be forced to build cake. After hearing the answer, it will value that cake at 1, but currently it values it only at 0.5. Similarly, if the programmer says death, it will produce three deaths - which it will value at 3, but currently values at 1.5. Since each of these options are equally likely, it gets only (0.5+1.5)/2 = 1 utilitons from asking.

In summary: the naive cake-or-death problem emerges for a value learning agent when it expects its utility to change, but uses its current utility to rank its future actions.

## Sophisticated cake or death: I know what you're going to say Using p(C(u)|w) rather than p(C(u)|e,a) does away with the naive cake or death problem.

Instead of having p(C(u1)|e,"ask") = p(C(u2)|e,"ask") = 0.5 in all possible worlds, we have p(C(u1)|w1)=p(C(u2)|w2) = 1 and p(C(u1)|w2)=p(C(u2)|w1) = 0. Hence if it asks and gets "cake" as an answer, it will know it is in world w1, and make a cake that it will value at 1 - crucially, it currently also values cake at 1, given that it is in world w1. Similarly, it values death at 1, given that it is in world w2. So its expected utility from asking is (1+3)/2=2. This is more than the utility of not asking, and so it will ask.

The agent remains vulnerable to a more sophisticated cake-or-death problem, though. Suppose it is still uncertain between cake or death in its utility function, but it has figured out that if asked, the programmer will answer "cake". Thereafter, it will make cake. In this situation, it will only derive +1 from asking, whereas it still derives +1.5 from not asking (and doing three deaths). So it won't ask - as long as it does this, it remains in w3.

What happened here? Well, this is a badly designed p(C(u)|w). It seems that it's credence in various utility function changes when it gets answers from programmers, but not from knowing what those answers are. And so therefore it'll only ask certain questions and not others (and do a lot of other nasty things), all to reach a utility function that it's easier for it to fulfil.

What we actually want, is that the agent be unable to predictably change its utility in any direction by any action (or lack of action). We want a p(C(u)|w) designed so that for all actions a and all putative utility functions u:

• Expectation(p(C(u) | a) = p(C(u)).

So there is a "conservation of expected correctness"; if we have this, the sophisticated cake-or-death argument has no traction. This is equivalent with saying that the prior P(C(u)) is well defined, irrespective of any agent action.

In summary: the sophisticated cake-or-death problem emerges for a value learning agent when it expects its utility to change predictably in certain directions dependent on its own behaviour.

# 47

New Comment

Post needs more intro section

I get the naive version, but not the sophisticated one -- I would have thought that if you are in w_2, then the programmer will definitely answer "death", so if you have figured out that the programmer will answer "cake", this means that you have enough evidence to conclude you are in w_1 -- i.e., p(w_1|e) = 1?

But if you don't ask, you are in neither w1 nor w2 - but in w3.

Oh, I missed that, thanks. I tend to think of a possible world in the context of decision theory as something that, for each of your possible strategies, specifies what will happen if you choose that strategy, because that's sufficient information to make the usual (Cartesian) decision theories work. If you interpret your "W" in that way, the cake-or-death problem seems to go away. But I can see the appeal of making "world" mean simply "everything that does actually happen" (meaning you don't include counterfactuals).

... (Cartesian) decision theories ...

What is a Cartesian decision theory? Googling it brings up a bunch of stuff about Descartes and Cartesian stuff in mathematics, but your comment is the only result that mentions decision theories.

By "Cartesian" decision theory, I mean the kind that is standard outside LW, where the decision process that makes the decision is not part of the model of the world. The epitome of this is AIXI (or actually even better, its computable cousin, AIXI-tl), and I think discussions of AIXI are where I got the word "Cartesian" from in this context: AIXI is considering complete models of the whole outside world, but doesn't understand that it's own computations are produced by part of that world, so that e.g. it would be a bad idea to drop a hammer on its own head. This would make more sense if AIXI's sensory input were radioed from the physical world into the Great Outer Beyond of souls, free will, and infinite computing power, and AIXI then radioed back the actions it wants to take, but there were no other interactions between the two worlds -- hence the word "Cartesian", as in Cartesian dualism.

So in the context under discussion here, a "world" is a full specification of the laws of physics and the full current physical state of the universe -- from which you can compute all that will happen in the future... except that you additionally need to know the input that will get radioed in from the Outer Beyond, specified by your strategy.

So it is a description rather than a name. Thank you.

One should note the origin of the phrase: Eddie Izzard's Cake Or Death sketch, about how the Church of England isn't very good at extremism.

Yep - linked to in the post.

I'd summarise the results of your post, just to check my understanding and to allow myself to be corrected.

The utility function of a value-learning agent will change as it encounters evidence, and depending on its actions.

This is a good ideal yet we should note two traps with this approach:

1) if an agent uses its current utility function to evaluate future actions, then it will avoid learning a different utility function - because a change in utility function will cause the agent to behave poorly based on its current standard (i.e. goal preservation from Omohundro's AI Drives)

2) So suppose that the agent uses its future utility function to evaluate its future decisions. Then, in a bad case, it will choose to learn the utility function that is the easiest. The bad case is that the agent thinks that certain actions will change its utility function in a predictable way. i.e. if E[p(U=u_1 | a)] != p(U=u_1). So we must enforce this axiom of probability E[p(U=u_1 | a)] == p(U=u_1) to prevent the agent from assigning itself an easy utility function.

Nothing new here, just carrying on explaining my understanding in case it helps others:

Following on from (2): in the simple case where the AI can ask the advisor or not, we want the expected utility after asking to also be used to evaluate the case where the AI doesn’t ask. i.e.

E[p(C=u1 | A="don't ask")] := E[p(C=u_1 | A="ask"] (:= is assignment; C is the correct utility function)

So we'll renormalise the probability of each utility function in the "don't ask" scenario.

A more complex case arises where there multiple actions cause changes in the utility function, e.g. if there are a bunch of different advisors. In these more complex cases, it's not so useful to think about a direction of assignment. The more useful model for what's going on is that the agent must have a distribution over C that is updated when it gets a different model of what the advisors will say.

Basically, requiring the agent to update its distribution over utility functions in a way that obeys the axioms of probability will prevent the agent from sliding toward the utility functions that are easiest to fulfil.