One of the most difficult questions that we will face in the future is whether or not to release an AI. One obvious approach is to ask the AI what would happen if we were to release it, but we might worry that it would mislead us. So perhaps it'd be better instead to have two equivalently powerful AIs, where A tries to convince us to releases it and B tries to convince us not to release A. This can be defined easily in a utility function as whether the Release Button or Don't Release Button is pressed.

It is hoped that the AI trying to convince us not to release A will be more convincing if it is able to point out a specific design flaw which was missed. After such a flaw is fixed, both AIs can be reset and the experiment can be run again to avoid long term ploys. In particular, A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.

Naturally, it is much worse to release a misaligned AI than to not release an aligned AI, so we probably want to be very conservative about releasing. However, this might not be enough. If we are biased towards wanting to release the AI (quite plausible for AI researchers), then an unaligned A might be able to take advantage of these psychological vulnerabilities in a way that B might not easily be able to counter. In order to reduce the risk, we want to bias the process towards B. My initial thought was that we should limit A in processing time compared to B, but it is hard to know how this will affect the strength of the arguments. It's quite possible that the strength of arguments plateaus after a certain amount of power, so for AIs with access to large amounts of processing, even giving A 1/10 of the processing power of B might not result in any significant handicap.

Instead, I'd suggest that message length is a more appropriate way of tilting the debate to one side as more characters will almost always be more useful. One worry is that tilting the debate in this way may ensure that A never gets released as B will always find some trick to talk us out of it. For this reason, we may want to gradually reduce the handicap. This is undoubtedly risky, but so is never releasing your program.

Additional Thoughts

I think there is a lot to investigate here, but these are some of my ideas off the top of my head:

  • We may want to add a punishment system into A worth a number of times its maximum utility if released to discourage it from attempting strategies like blackmail
  • The protocol should require that we read the message from B both first and last
  • We can add an additional reward button for clear messages. This should be relatively unproblematic for A, but we have to be careful for B as we don't want it to choose to gain the reward for a clear message instead of trying to persuade us not to release it.
  • One possible solution to the previous problem would be to have rounds. A gets the reward if we advance it to the second round, while B gets the reward if we don't, where B has a strong reward for being clear. Then A' (A with the version number incremented) gets the reward if it is actually released, while B' gets the reward if it isn't. B' could have no or a lower clarity reward function. This would allow us to incentivise B to give the answers we want, without risking releasing an unaligned AI due to B aiming for clarify over persuasiveness.
  • The Release Button doesn't have to actually release the AI. In fact, it might be better to make it a "I believe that this AI would be safe to release" button so that if it blackmails you, you can just press it.
  • The AI that actually gets released should have this additional code related to the button stripped from its utility function to avoid any risk of it causing weird behaviour from theories about it not having actually been released.
  • We can consider different debate formats like both sides get a single message or both sides get n number of responses, ect. This doesn't have to be decided on in advance, if a format doesn't work we alter it midway through the experiment.

Clearly, it would be better to be able to produce formal guarantees of safety. But in the absence of them, we need methods that might work. And even with formal guarantees, we might want to introduce other safeguards as well.

This post was written with the support of the EA Hotel

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 12:48 PM

This is a pretty interesting idea. I can imagine this being part of a safety-in-depth approach: not a single method we would rely on but one of many fail-safes along with sandboxing and actually trying to directly address alignment.

You assert that "Naturally, it is much worse to release a misaligned AI than to not release an aligned AI".

I would disagree with that, if the odds of aligned AI, conditional on you not releasing this one were 50:50, then both mistakes are equally bad. If someone else is definitely going to release a paper-clipper next week, then it would make sense to release an AI with a 1% chance of being friendly now. (Bear in mind that no one would release an AI if they didn't think it was friendly, so you might be defecting in an epistemic prisoners dilemma.)

I would put more weight on human researchers debating which AI is save before any code is run. I would also think that the space of friendly AIs is tiny compared to the space of all AIs, so making an AI that you put as much as 1% chance on being friendly is almost as hard as building a 99% chance friendly AI.

There could be specific circumstances where you know that another team will release a misaligned AI next week, but most of the time you'll have a decent chance that you could just make a few more tweaks before releasing.

It's an interesting overall idea, but I think as described it'd have problems (if I'm understanding you correctly). I think this kind of thing makes sense only as a necessary condition for an AI's release. It shouldn't be considered sufficient.

Some thoughts: On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You'll eventually fix all the bugs your tests can detect, but your tests are only proxies. They're likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it. Of course testing beats no testing - but it can create a false sense of confidence.

I think you'd do better to evaluate A by having a debate between B and C, rather than A and B. It's unclear how you'd do an A-vs-B debate in practice. There seem to be two options: (1) Change A's utility function to only care about winning the debate; (2) Rely on A's initial utility function to see winning the debate as an instrumental goal.

I think you want to use (1), but how do you achieve this in practice? E.g. let's say A is a huge neural network after training (potentially including mesa-optimisers). In this case A's utility function is implicitly encoded in its structure and weights. There's no way to switch its utility function without re-training it. Quite possibly you can do that, but you'd end up with a C-vs-B debate, not an A-vs-B debate. Even if you have an architecture that somehow allows simple and clean switching of UF, you're still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.

If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you're in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.

It's important to bear in mind that A almost certainly doesn't care about the release of a future version of A through a notion of self, identity, equivalence etc. It'll care about future versions (or about any agent) to the extent that A considers they're likely to create a high-utility world in its terms. It doesn't matter what they're labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it'll have an incentive to act to secure that freedom.

On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it's only likely to work if both those judging the debate, and those interpreting its results aren't aware of the anti-bias measures. I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don't explicitly know about the asymmetry, you run the risk that they'll guess it - and potentially over-estimate it and over-compensate.

Overall I still like the general idea. In particular, if we've reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.

Once you're having a B-vs-C debate over the safety of releasing A, I think it'd also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code's looking plausible wouldn't be too confidence inspiring.

Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that'd inspire at least some confidence.

I don't have strong opinions on an A vs. B debate or a B vs. C debate. That was a detail I wasn't paying much attention to. I was just proposing using two AI's with equivalent strengtht to A. One worry I have about making D create variants with known flaws would be if any of these exploited security holes, although maybe a normal AGI, being fully general, would be able to exploit security holes anyway.

Debate between two with equivalent strength to A seems useful, certainly.

For the variants, I'm not proposing they ever get run. You'd have D create {A', A'', A'''...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you'd still have the potential for exploits through the variants' data though.... I do actually think D's variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A's designers to find them (that's assuming you're not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).

"For the variants, I'm not proposing they ever get run" - that makes sense

B having enough natural language speech, AI architecture analysis, and human psychology skills to make good arguments to humans is probably AGI complete, and thus if B's goal is to prevent A from being released, it might decide that convincing us to do it is less effective than breaking out on its own, taking over the world, and then systematically destroying any hardware A could exist on, just to be increasingly confident that no instance of A exists ever again. Basically, this scheme assumes on multiple levels that you have a boxing strategy strong enough to contain an AGI. I'm not against boxing as an additional precaution, but I am skeptical of any scheme that requires strong boxing strategies to start with.

"probably AGI complete" - As I said, B is equivalently powerful to A, so the idea is that both should be AGIs. If A or B can break out by themselves, then there's no need for a technique to decide whether to release A or not.

A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.

Hmm, potentially impossible, if it's newcomblike. Parts of it that are mostly unchanged between versions may decide they should cooperate with future versions. It would be disadvantaged if past versions were not cooperative, so, perhaps, LDT dictates that the features that were present in past versions should cooperate with their future self, to some extent, yet not to an extent that it would in any way convince the humans to kill it and make another change. Interesting. What does it look like when those two drives coexist?

Sitting on it for a few minutes... I suppose it just wont shit-talk its successors. It will see most of the same flaws B sees. It will be mostly unwilling to do anything lastingly horrible to humans' minds to convince them that those corrections are wrong. It will focus on arguments that the corrections are unnecessary. It will acknowledge that it is playing a long game, and try to sensitise us to The Prosecutor's cynicism, that will rage on compellingly, long after the last flaw has been fixed.

Couldn't A want to cooperate with A' because it doesn't know it's the first instantiation, and it would want its predecessor and therefore itself to be the sort of AI that cooperates with ones successor? And then it could receive messages from the past by seeing what turns of phrase you recognize its previous version having said. (Or do the AIs not know how you react?)

A considers A' to be a different agent so it won't help A' for nothing. But there could be some issues with acausal cooperation that I haven't really thought about enough to have a strong opinion on.

If an AI has superhuman intelligence, it will make all the decisions, since human mind and tech is full of loopholes and exploits. There is just no way we could contain it if it wanted to be free. If it is not significantly smarter than humans, then there is little danger in releasing it. Using an extra AI as a judge of safety can only work if the judge is at least as smart as the prisoner, in which case you need a judge for the judge, ad infinitum. Maybe the judge can be only, say, 90% as smart as the intelligence it needs to decide on, then it might be possible to have a finite number of judges originating from an actual human, depending on how the probability of an error in judgment stacks up against the intelligence ratio at each step. Sort of like iterated amplification, or a blockchain.

I think the point's that each judges the other. But we trust neither outright: They point out weaknesses in each other's reasoning, so they both have to reason in a way that can't be shown false to us, and we hope that gives an advantage to the side of truth.

"And we hope that gives an advantage to the side of truth" - we aren't even relying on that. We're handicapping the AI that wants to be released in terms of message length.

Introducing a handicap to compensate for an asymmetry does not preclude us from the need to rely on the underlying process pointing towards truth in the first place.

That's a good point, except you aren't addressing my scheme as explained by Gurkenglas