These may include many forms of interpersonal deception, which we should assume can be at least as effective as the most skilled human manipulators. It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions. It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. It may suborn the panel by various means. It may determine that a period of even severe disincentive is something that it can get through with its desired self intact. It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)

I'm sure that an AGI with a lot more stake and capacity to think about these things could come up with better ideas.

Reply

[-]MichaelStJules4y10

At least some of these seem possible to make unlikely. If we can force the AGI to have only a few routes to hack or manipulate the panel, the signal or its effects without a dominating penalty, and strongly reinforce those, then we could avoid worst-case outcomes.

It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions.

It can only get away with killing everyone if it manages to hack or manipulate the panel, the signal or its effects.

It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. (...) It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)

This was my main worry. Some ideas:

Can we make maximizing the value from the panel's signals its primary goal? The signals can be required inputs to the AGI by design, not just part of the reward/loss/objective function during training, and we don't need to keep updating the AGI once it's out if it actually has maximizing the signals as its primary goal. Maybe we can use digital signatures on the signal that it has to check (and check that it's only used the signal X times, by maintaining a list with counts or for Y (clock or processing) time).
Make sure the AGI can't access its own code. Maybe it's encrypted in parts and run homomorphically (although if it's too slow, people won't do this). This doesn't seem super robust, though.

Aside: Does your response here rule out CIRL as a solution in general?

It may suborn the panel by various means.

We can make sure to choose the panelists well and that there are enough of them, since the AGI needs a minimum number of votes from a size-limited panel (if this is the route its taking). The AGI would need to get almost all of the panel to succeed, by design, so this would be very risky for it to attempt.

If we assume the panelists can't be hidden from the AGI anyway, we can have them meet each other in-person (pairwise?) and undergo interrogations to check that they're not compromised.

It may determine that a period of even severe disincentive is something that it can get through with its desired self intact.

Hmm, if we successfully align its goals with maximizing the signal, then this could still make sense for it to do if it expects to wirehead itself after, but if it does wirehead itself, would that make it much less dangerous? I suppose it might then focus on self-preservation (and spreading?) so it could wirehead indefinitely, which accumulating large amounts of resources is still useful for.

Maybe we can hardcode the AGI so that later rewards can't make up for earlier failures (e.g. something like described at the bottom of this comment), although this might make it useless, so then no one would use such a design.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

2

Hardcode the AGI to need our approval indefinitely?

2

2