# 7

A putative new idea for AI control; index here.

Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.

There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.

## The agent design

Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.

The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.

Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.

Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).

The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.

## Stability of the design

The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.

Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.

I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).

A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:

• Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.

The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.

Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:

• In W1, B will be offered the possibility of buying C(0.75ε).
• In W2, B will be offered the possibility of buying C(1.5ε).
• In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
• In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
• In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
• In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
• In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).

From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).

W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.

Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).

So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.

## Irrelevant information

• Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).

These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.

For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).

Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.

Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).

Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).

Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.

# 7

New Comment

Typo?

Yep! Corrected now.

Consider an outcome with utility 4/epsilon, which you are considering incurring a cost of 1 to achieve with probability epsilon/2.

This design seems to essentially be saying:

"Yes, that outcome is good. But it's not so good that you should be willing to pay a cost of 1 to achieve it with probability only epsilon."

But written mathematically, that is precisely the claim that the utility of the outcome is less than 1/epsilon. That's exactly the kind of data that the utility function was intended to represent.

As far as I can tell the only reason to use this kind of design is because you explicitly want intransitive preferences. That is, you want to be able to say "A 50% chance of B is better than A, and a 50% chance of C is better than B, ... but a 0.0001% chance of Z is worse than A."

If you want to have this kind of intransitivity, it seems like you should explicitly engage with why it is confusing rather than trying to build agents that exhibit it without looking obviously crazy.

? After the ultra choice is made, the agent is a standard expected utility maximiser for uMε, so there is no transitivity issue.

And the point of the agent is that we know that odd things happen around very small probabilities and very large utilities. One way of avoiding that is to bound the utility - but the bound is an arbitrary choice. This approach also bounds the utility, but by choosing the bound point as being something reached with very low probability, rather than an a priori choice.

U is supposed to actually reflect your preferences. The fact that you want to make this adjustment demonstrates that either:

1. You don't trust the agents probability distribution, or
2. You don't actually have the values represented by U.

As far as I can tell you are saying (2), since the "odd things" you refer to don't depend on epistemic errors (and also apply to your own beliefs).

So: if your values aren't represented by U, then why are you building an agent that maximizes U? Instead, consider building an agent that maximizes your values.

The point of changing the utility function isn't to patch up odd behavior, it's to make the utility function represent your beliefs.

Some people genuinely believe in unbounded utility, find no way of rejecting the Pascal Muggers, but desire to do so (if only because of the immense potential damage of being known to be Pascal muggable).

There's also a wide latitude in fixing your beliefs in these extreme cases. People's values are generally solid for everyday events, and much more tentative in these cases. I'd wager that most people who fix their values in the tail would find that living in the worlds they describe is much different from what they expected, and that they would want to change their values if they could.

An unbounded utility function does not literally make you "Pascal's muggable"; there are much better ways to seek infinite utility than to pay a mugger.

But it's even worse than that, since an unbounded utility function will make the utility of every realistic outcome undefined; an unbounded real-valued utility function doesn't represent any complete set of preferences over infinite lotteries. So I agree that someone who describes their preferences with an unbounded utility function needs to clarify what they actually want.

But if people "genuinely believe in" (i.e. "actually have") unbounded utility functions, they would be horrified by the prospect of passing up massive gains just because those gains are improbable. That's precisely what it means to have an unbounded utility function. I don't understand what it means to "actually have" an unbounded utility function but to be happy with that proposed resolution.

So it seems like those people need to think about what their preferences are over infinitely valuable distributions over outcomes; once they have an answer, so that they can actually determine what is recommended by their purported preferences, then they can think about whether they like those recommendations.

What do you mean "they would want to change their values if they could"? That sounds nonsensical. I could round it to "they would find that their values differ from the simple model of their values they had previously put forward." But once I've made that substitution, it's not really clear how this is relevant to the discussion, except as an argument against taking extreme and irreversible actions on the basis of a simple model of your values that looks appealing at the moment.

An unbounded utility function does not literally make you "Pascal's muggable"; there are much better ways to seek infinite utility than to pay a mugger.

Have you solved that problem, then? Most people I've talked don't seem to believe it's solved.

except as an argument against taking extreme and irreversible actions on the basis of a simple model of your values that looks appealing at the moment.

The approach I presented is designed so that you can get as close to your simple model, while reducing the risks of doing so.

Have you solved that problem, then? Most people I've talked don't seem to believe it's solved.

You aren't supposed to literally pay the mugger; it's an analogy. Either:

(1) you do something more promising to capture hypothetical massive utility (e.g. this happens if we have a plausible world model and place a finite but massive upper bound on utilities), or (2) you are unable to make a decision because all payoffs are infinite.

? I don't see why the world needs to be sufficiently convenient to allow (1). And the problem resurfaces with huge-but-bounded utilities, so invoking (2) is not enough.

You cited avoiding the "immense potential damage of being known to be Pascal muggable" as a motivating factor for actual humans, suggesting that you were talking about the real world. There might be some damage from being "muggable," but it's not clear why being known to be muggable is a disadvantage, given that here in the real world we don't pay the mugger regardless of our philosophical views.

I agree that you can change the thought experiment to rule out (1). But if you do, it loses all of its intuitive force. Think about it from the perspective of someone in the modified thought experiment:

You are 100% sure there is no other way to get as much utility as the mugger promises at any other time in the future of the universe. But somehow you aren't so sure about the mugger's offer. So this is literally the only possible chance in all of history to get an outcome this good, or even anywhere close. Do you pay then?

"Yes" seems like a plausible answer (even before the mugger opens her mouth). The real question is how you came to have such a bizarre state of knowledge about the world, not why you are taking the mugger seriously once you do!

but it's not clear why being known to be muggable is a disadvantage, given that here in the real world we don't pay the mugger regardless of our philosophical views.

Being known to be muggable attracts people to give it a try. But if we don't pay the mugger in reality, then we can't be known to be muggable, because we aren't.

You are 100% sure there is no other way to get as much utility as the mugger promises at any other time in the future of the universe.

It doesn't seem unreasonable to get quasi 100% if the amount the mugger promises is sufficiently high ("all the matter in all the reachable universe dedicated to building a single AI to define the highest number possible - that's how much utility I promise you").

But what do you mean by "genuinely believe in unbounded utility"? Given the way Von Neumann and Morgenstern define numerical utility, unbounded utility basically just means that you have desires that lead you to keep accepting certain bets, no matter how low the probability goes. They talk about this in their work:

And yet the concept of mathematical expectation has been often questioned, and its legitimateness is certainly dependent upon some hypothesis concerning the nature of an “expectation.” Have we not then begged the question? Do not our postulates introduce, in some oblique way, the hypotheses which bring in the mathematical expectation?

More specifically: May there not exist in an individual a (positive or negative) utility of the mere act of “taking a chance,” of gambling, which the use of the mathematical expectation obliterates?

They go on to say that "we have practically defined numerical utility as being that thing for which the calculus of mathematical expectations is legitimate," and conclude that "concepts like a 'specific utility of gambling' cannot be formulated free of contradiction on this level."

Or in other words, saying that for any given utility, there exists a double utility, just does not mean anything except that there is some object such that you consider a 50% chance of it, and a 50% chance of nothing, equal in value to the first utility. In a similar way, you cannot assert that you have an unbounded utility function in their sense, unless there is some reward such that you are willing to pay \$100 for objective odds of one in a googolplex of getting the reward. If you will not pay \$100 for any reward whatsoever at these odds (as I will not) then your utility function is not unbounded in the Von Neumann Morgenstern sense.

This is just a mathematical fact. If you still want to say your utility function is unbounded despite not accepting any bets of this kind, then you need a new definition of utility.

People don't have utilities; we have desires, preferences, moral sentiments, etc... and we want to (or have to) translate them into utility-equivalent formats. We also have meta-preferences that we want to respect, such as "treat the desires/happiness/value of similar beings similarly". That leads straight to unbounded utility as the first candidate.

So I'm looking at "what utility should we choose" rather than "what utility do we have" (because we don't have any currently).

I agree that we do not objectively have a utility function, but the kinds of things of that you say. I am simply saying that the utility function that those things most resemble is a bounded utility function, and people's absolute refusal to do anything for the sake of an extremely small probability proves that fact.

I am not sure that the meta-preference that you mention "leads straight to unbounded utility." However, I agree that understood in a certain way it might lead to that. But if so, it would also lead straight to accepting extremely small probabilities of extremely large rewards. I think that people's desire to avoid the latter is stronger than their desire for the former, if they have the former at all.

I do not have that particular meta-preference because I think it is a mistaken result of a true meta-preference for being logical and reasonable. I think one can be logical and reasonable while preferring benefits that are closer to benefits that are more distant, even when those benefits are similar in themselves.

I think that people's desire to avoid the latter is stronger than their desire for the former, if they have the former at all.

Yes, which is what my system is set up for. It allows people to respect their meta-preference, up to the extent where mugging and other issues become possible.

An alternative to bounded utility is to suppose that probabilities go to zero faster than utility. In fact, the latter is a generalisation of the former, since the former is equivalent to supposing that the probability is zero for large enough utility.

However, neither "utility is bounded" nor "probabilities go to zero faster than utility" amount to solutions to Pascal's Mugging. They only indicate directions in which a solution might be sought. An actual solution would provide a way to calculate, respectively, the bound or the limiting form of probabilities for large utility. Otherwise, for any proposed instance of Pascal's Mugging, there is a bound large enough, or a rate of diminution of P*U low enough, that you still have to take the bet.

Set the bound too low, or the diminution too fast ("scope insensitivity"), and you pass up some gains that some actual people think extremely worth while, such as humanity expanding across the universe instead of being limited to the Earth. Telling people they shouldn't believe in such value, while being unable to tell them how much value they should believe in isn't very persuasive.

This alternative only works because it asserts that such and such a bet is impossible, e.g. there is a reward that you would pay \$100 for if the odds were one in a googolplex of getting the reward, but in fact the odds for that particular reward are always less than one in a googolplex.

That still requires you to bite the bullet of saying that yes, if the odds were definitely one in a googolplex, I would pay \$100 for that bet.

But for me at least, there is no reward that I would pay \$100 for at that odds. This means that I cannot accept your alternative. And I don't think that there are any other real people who would consistently accept it either, not in real life, regardless of what they say in theory.

That still requires you to bite the bullet of saying that yes, if the odds were definitely one in a googolplex, I would pay \$100 for that bet.

The idea of P*U tending to zero axiomatically rules out the possibility of being offered that bet, so there is no need to answer the hypothetical. No probabilities at the meta-level.

Or, if you object to the principle of no probabilities at the meta-level, the same objection can be made to bounded utility. This requires you to bite the bullet of saying that yes, if the utility really were that enormous etc.

The same applies to any axiomatic foundation for utility theory that avoids Pascal's Mugging. You can always say, "But what if [circumstance contrary to those axioms]? Then [result those axioms rule out]."

The two responses are not equivalent. The utility in a utility function is subjective in the sense that it represents how much I care about something; and I am saying that there is literally nothing that I care enough about to pay \$100 for a probability of one in a googolplex of accomplishing it. So for example if I knew for an absolute fact that for \$100 I could get that probability of saving 3^^^^^^^^^3 lives, I would not do it. Saying the utility can't be that enormous does not rule out any objective facts: it just says I don't care that much. The only way it could turn out that "if the utility really were that enormous" would be if I started to care that much. And yes, I would pay \$100 if it turned out that I was willing to pay \$100. But I'm not.

Attempting to rule out a probability by axioms, on the other hand, is ruling out objective possibilities, since objective facts in the world cause probabilities. The whole purpose of your axiom is that you are unwilling to pay that \$100, even if the probability really were one in a googolplex. Your probability axiom is simply not your true rejection.

Saying the utility can't be that enormous does not rule out any objective facts: it just says I don't care that much.

To say you don't care that much is a claim of objective fact. People sometimes discover that they do very much care (or, if you like, change to begin to very much care) about something they did not before. For example, conversion to ethical veganism. You may say that you will never entertain enormous utility, and this claim may be true, but it is still an objective claim.

And how do you even know? No-one can exhibit their utility function, supposing they have one, nor can they choose it.

As I said, I concede that I would pay \$100 for that probability of that result, if I cared enough about that result, but my best estimate of how much I care about that probability of that result is "too little to consider." And I think that is currently the same for every other human being.

(Also, you consistently seem to be implying that "entertaining enormous utility" is something different from being willing to pay a meaningful price for small probability of something: but these are simply identical -- asking whether I might objectively accept an enormous utility assignment is just the same thing as asking whether there might be some principles which would cause me to pay the price for the small probability.)

I have the following obvious question: for example, we create an AI which ignores risks below 0.0001 probability and we start to use it as a mind of a self-driving car. Such AI will often pass other cars on incoming line (or ignore false positive about pedestrians, if we want more realist example), but after 10 000 passing it will have horrific accident., which will happen only after just a few years of driving: so it will be unsafe, but not in the same way as a paperclipper. What could be done about it?

It seems this type of AI might cheat and set ε=0 using the following method:

1. create an "insurance company" subagent (see below for what it is and why creating it has a utility that is >ε).
2. give the insurance company a few utilons (paper clips) to hold as hostages.
3. set a rule that all future subagents will be able to communicate the insurance company.
4. the insurance company will "buy" all the reward that has a probability < ε from subagents at roughly its fair price (its utility)
5. all subagents won't ignore events that happen with probability < ε because their very existence is has a probability of 1 to increase their utility (via the insurance company).

alternatively for any ε you set, it will be profitable for the AI to create a new version of itself with the same utility function and ε'=ε/2 than give the new AI all the resources it has and commit suicide.

alternatively for any ε you set, it will be profitable for the AI to create a new version of itself with the same utility function and ε'=ε/2 than give the new AI all the resources it has and commit suicide.

This doesn't seem true. A ε/2 AI will take risks, looking for higher utility, that the ε AI wouldn't.