I wouldn't be surprised if others have thought about this before and decided it doesn't work, but I haven't seen it discussed. Is there a version of this proposal that's promising? If so, what would it look like? If not, why not?

What I'm imagining is that there will be a panel of people, and every n days, the AGI will receive a punishment signal (-1) by default by design, unless that panel votes (with e.g. >90% consensus) to replace that signal with 0 and instead give the AGI a reward signal (+1, default 0), which they should commit to doing if and only if the AGI hasn't done anything horrible. I think this could be implemented with (something like) cooperative inverse reinforcement learning.

The AGI should be told that these punishments are being prevented manually and the rewards, given manually, so that it knows that killing everyone or taking control is likely to cost it a lot, unless it can hack the signals first. Furthermore, if it finds a way to disable the punishment signal without being able to hack the reward itself, too, it will just not get either signal with a nonzero value, indefinitely.

Here are some important details that I expect need to be fleshed out:

  1. We make sure that these signals have greater weight than everything else the AGI might care about, but not so much that it leads to paralysis or fanaticism. I had one idea here.
  2. We make sure the signal is secure and too hard for the AGI to hack or otherwise manipulate without alerting us, after which the AGI would be punished and/or stop receiving the panel's rewards.
  3. Successors for the panel must be decided on, so that this arrangement can be maintained indefinitely. We'd like to do this in a way that makes it unlikely that the AGI can get on the panel itself or put someone there to represent itself. Maybe the panelists decide successors.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 4:04 AM

One of the main concerns about AGI, and especially superintelligent AGI, is that it might indeed "hack" the control mechanisms in various ways.

These may include many forms of interpersonal deception, which we should assume can be at least as effective as the most skilled human manipulators. It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions. It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. It may suborn the panel by various means. It may determine that a period of even severe disincentive is something that it can get through with its desired self intact. It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)

I'm sure that an AGI with a lot more stake and capacity to think about these things could come up with better ideas.

At least some of these seem possible to make unlikely. If we can force the AGI to have only a few routes to hack or manipulate the panel, the signal or its effects without a dominating penalty, and strongly reinforce those, then we could avoid worst-case outcomes.

It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions.

It can only get away with killing everyone if it manages to hack or manipulate the panel, the signal or its effects.

 

It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. (...) It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)

This was my main worry. Some ideas:

  1. Can we make maximizing the value from the panel's signals its primary goal? The signals can be required inputs to the AGI by design, not just part of the reward/loss/objective function during training, and we don't need to keep updating the AGI once it's out if it actually has maximizing the signals as its primary goal. Maybe we can use digital signatures on the signal that it has to check (and check that it's only used the signal X times, by maintaining a list with counts or for Y (clock or processing) time).
  2. Make sure the AGI can't access its own code. Maybe it's encrypted in parts and run homomorphically (although if it's too slow, people won't do this). This doesn't seem super robust, though.

 

Aside: Does your response here rule out CIRL as a solution in general?

 

It may suborn the panel by various means.

We can make sure to choose the panelists well and that there are enough of them, since the AGI needs a minimum number of votes from a size-limited panel (if this is the route its taking). The AGI would need to get almost all of the panel to succeed, by design, so this would be very risky for it to attempt.

If we assume the panelists can't be hidden from the AGI anyway, we can have them meet each other in-person (pairwise?) and undergo interrogations to check that they're not compromised.

 

It may determine that a period of even severe disincentive is something that it can get through with its desired self intact.

Hmm, if we successfully align its goals with maximizing the signal, then this could still make sense for it to do if it expects to wirehead itself after, but if it does wirehead itself, would that make it much less dangerous? I suppose it might then focus on self-preservation (and spreading?) so it could wirehead indefinitely, which accumulating large amounts of resources is still useful for.

Maybe we can hardcode the AGI so that later rewards can't make up for earlier failures (e.g. something like described at the bottom of this comment), although this might make it useless, so then no one would use such a design.