# 3

A putative new idea for AI control; index here.

EDIT: I feel this post is unclear, and will need to be redone again soon.

This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.

## Extending niceness/reduced impact

I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).

This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.

What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize

• Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]}

where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.

## Perfect and imperfect extensions

The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...

...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).

One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.

Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.

## High impact from reduced impact

With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).

In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.

Then the top level AI charged with aiming the y coordinates is given the following task:

• It is motivated to cause the correct y-coordinate to be outputted.
• If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
• It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
• The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.

So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.

And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.

And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.

## Failure modes

There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.

The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.

Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.

# 3

New Comment

If the x-coordinate AI is not turned on (call this event ¬X), it is motivated to have reduced impact. This motivation is sufficiently strong that it will not want to have the correct y-coordinate outputted.

It will produce a robot that will aim to aim the the y-coordinate of the laser correctly, given ¬X, and not expand dangerously.

Aren't these bits contradictory? Wouldn't the result be not aiming?

If ¬X happened, the result would be missaiming. But since X happens (almost certainly), it aims correctly.

The programmer expects to ¬X and it must program the bot with things that are X agnostic so it is planning to not aim. Then because the programmed bot can't be X sensitive it will make essentially as if ¬X.

If the mission is to do the grue thing given that t1 the grue thing to do would be to press the blue button but the bot presses the green button. Such a solution is not grue-friendly or blue-friendly.

I'm not sure what you're saying. The AI is programmed to be reduced impact, conditional on ¬X. If ¬X happens, then outputting the correct y coordinates is reduced impact, which it will thus do (as it is separately motivated to do that).

So, given ¬X, the AI is motivated to: a) output the correct y coordinate (or cause its subagent to do so), b) have a reduced impact overall.

The whole construction is an attempt to generalise a) and b) to X, even though they are in tension/contradiction with each other in X (because outputing the correct y coordinate will have a high impact).

If ¬X happens, then outputting the correct y coordinates is reduced impact, which it will thus do (as it is separately motivated to do that).

If the x-coordinate AI is not turned on (call this event ¬X), it is motivated to have reduced impact. This motivation is sufficiently strong that it will not want to have the correct y-coordinate outputted.

These bits are contradictory. One tells of a story where two low-impact options are tie-breaked by an aiming instinct to aim anyway. The other tells that "sit tight" instinct will overwhelm the aiming instinct.

If you want to control what happens in X, drives that are conditioned on ¬X are irrelevant. In my understanding the attempt is to generalise the reduced impact drive by not having it conditioned on X. Then what it does in ¬X can not be based on the fact that ¬X. But it can't deduce that aiming is low impact even in ¬X because it must assume that the x-aiming robot could be on and that would make it a high impact decision. It must use the same decision process in both X and ¬X and the X decision process can't be based on what it would do if it where allowed to assume that ¬X (that is you are not allowed to know whether the grue object is currently green or blue and you can't decide what you would do if it were green based on what you would do if it was blue).

Indeed. I have corrected the top post. Thanks!

I'm also struggling with the above. The first quote says that with event ¬X "it will NOT want to have the correct y-coordinate outputted". The second says the opposite, the robot WILL output "the y-coordinate of the laser correctly, given ¬X".

Slider was correct - I made a mistake. The correct sentence would have been "This motivation is sufficiently strong that it will not want to have the correct y-coordinate outputted, if the correct x-coordinate were also there), but that got too complicated, so I removed the sentence entirely.

If I have this correct, the y-coordinate AI would like to create a subagent that would do big-impact things to get the right coordinate if and only if X, and the hope is that our Z-restriction will prevent it from doing that?

If so, can it create a subagent with the goal "Create a sub-sub-agent that will do big-impact things to get the right coordinate if and only if X"? If the subagent creates such a sub-sub-agent, it will have achieved its goal equally whether X is true or not.

Except that the whole setup is designed so that "iff X" kind of statements are not possible.

But if Z is 'Created a subsubagent with source code S' for some static string S, the value of X has no bearing on the value of Z, nor is Z decomposable, regardless of what S may be.

Actually, it is decomposable in the way we're thinking of here. Let's assume that S is the code for a superintelligent AI that is motivated to cause a nuclear war iff X.

Then, in the future, if we notice a nuclear war, this is evidence for Z if X happened, but not if ¬X happened.

If that counts as a decomposition, can't we decompose 'Output the correct y-coordinate' the same way? 'The asteroid is destroyed by a laser' is evidence that the y-coordinate was correct iff X happened.

Right, I think I understand this; thanks for prodding on this issue.

You are right; "the asteroid is destroyed by a laser" is evidence the y coordinate was correct iff X happened. So that observation cannot be used as a natural category.

"The asteroid hits the Earth", however, can. This is because of "the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100" and "let's say aiming the laser correctly will reduce the odds of impact by a factor of 10".

This is relevant because the AI is motivated to come up with a set of natural categories that allows it to successfully define reduced impact and also extend it. Therefore "The asteroid hits the Earth" is something it will choose to use, rather than "the laser hits the asteroid". So that is likely not filtered.

Where does that leave your Z "Created a subsubagent with source code S"? It will be banned if the probability "swing" from nuclear war to non-nuclear war caused by S is less than 100.

So the real effect of all this work is to control the probability swing of higher impact events. A probability increase of 100 is not safe for things like nuclear war! It seems that we might be able to have a more complicated setup with a series of AIs that each increase the probability of some event by a small amount, but I'm not sure how to avoid them increasing the probability of Z as well.

EDIT: this is how I would handle it. The `allowed swing' of impact is reduced, to 2, say. However the outcome - deflection of the asteroid - is specifically excluded from this effect. Thus we assume that, in a weeks time, someone will either "A: announce the asteroid missed us", or not do so. In our equations, we allow A to be an event excluded from the reduced impact. eg P({Zj=aj} | X) / P({Zj=aj} | ¬X) is not constrained between 2 and 1/2, instead, P({Zj=aj} | X, A) / P({Zj=aj} | ¬X, A) is.

Possibly... This might be the fatal flaw...

Interesting. Let me think more.