One of the most difficult questions that we will face in the future is whether or not to release an AI. One obvious approach is to ask the AI what would happen if we were to release it, but we might worry that it would mislead us. So perhaps it'd be better instead to have two equivalently powerful AIs, where A tries to convince us to releases it and B tries to convince us not to release A. This can be defined easily in a utility function as whether the Release Button or Don't Release Button is pressed.
It is hoped that the AI trying to convince us not to release A will be more convincing if it is able to point out a specific design flaw which was missed. After such a flaw is fixed, both AIs can be reset and the experiment can be run again to avoid long term ploys. In particular, A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.
Naturally, it is much worse to release a misaligned AI than to not release an aligned AI, so we probably want to be very conservative about releasing. However, this might not be enough. If we are biased towards wanting to release the AI (quite plausible for AI researchers), then an unaligned A might be able to take advantage of these psychological vulnerabilities in a way that B might not easily be able to counter. In order to reduce the risk, we want to bias the process towards B. My initial thought was that we should limit A in processing time compared to B, but it is hard to know how this will affect the strength of the arguments. It's quite possible that the strength of arguments plateaus after a certain amount of power, so for AIs with access to large amounts of processing, even giving A 1/10 of the processing power of B might be insufficient.
Instead, I'd suggest that message length is a more appropriate way of tilting the debate to one side as more characters will almost always be more useful. One worry is that tilting the debate in this way may ensure that A never gets released as B will always find some trick to talk us out of it. For this reason, we may want to gradually reduce the handicap. This is undoubtedly risky, but so is never releasing your program.
I think there is a lot to investigate here, but these are some of my ideas off the top of my head:
- We may want to add a punishment system into A worth a number of times its maximum utility if released to discourage it from attempting strategies like blackmail
- The protocol should require that we read the message from B both first and last
- We can add an additional reward button for clear messages. This should be relatively unproblematic for A, but we have to be careful for B as we don't want it to choose to gain the reward for a clear message instead of trying to persuade us not to release it.
- One possible solution to the previous problem would be to have rounds. A gets the reward if we advance it to the second round, while B gets the reward if we don't, where B has a strong reward for being clear. Then A' (A with the version number incremented) gets the reward if it is actually released, while B' gets the reward if it isn't. B' could have no or a lower clarity reward function. This would allow us to incentivise B to give the answers we want, without risking releasing an unaligned AI due to B aiming for clarify over persuasiveness.
- The Release Button doesn't have to actually release the AI. In fact, it might be better to make it a "I believe that this AI would be safe to release" button so that if it blackmails you, you can just press it.
- The AI that actually gets released should have this additional code related to the button stripped from its utility function to avoid any risk of it causing weird behaviour from theories about it not having actually been released.
- We can consider different debate formats like both sides get a single message or both sides get n number of responses, ect. This doesn't have to be decided on in advance, if a format doesn't work we alter it midway through the experiment.
Clearly, it would be better to be able to produce formal guarantees of safety. But in the absence of them, we need methods that might work. And even with formal guarantees, we might want to introduce other safeguards as well.
This post was written with the support of the EA Hotel