Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.

Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.

Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.

Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the even that the output signal from X is scrambled away from the the original output.

Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.

Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!

The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.

New Comment
3 comments, sorted by Click to highlight new comments since:

This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won't have such a partition built in.

If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won't produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.

Suppose you want the answer to a question Q, which is a yes-or-no question. Then pick a hard problem H, which is an inconsequential yes-or-no question that AIs can solve reliably, but which humans can't, and for which P(H)=0.5. Take two AIs X and Y. The first AI outputs X=xor(Q,H), and believes that the second AI will output a coin flip. The second AI outputs Y=H, and believes that the first AI will output a coin flip. Then the answer can be obtained by combining the two outputs, xor(X,Y).

Interesting generalisation.

The next step is to allow more interaction between AI and world, while still minimising impact safely…

Interesting generalisation.

The next step is to allow more interaction between AI and world, while still minimising impact safely...