AI Unsafety via Non-Zero-Sum Debate

VojtaKovarik

In this post, I describe how to view debate as a way of assisting a human to spot flaws in an AI’s proposal. I then argue that the zero-sum assumption is critical for making debate work and that various seemingly-helpful modifications of debate might break it instead.

A naive way of using arbitrary optimizers as oracles: Suppose you have a black-box optimizer $X$ that can be connected to any well-defined quantity to be maximized. $X$ can potentially be very powerful - e.g., having a highly accurate model of the world and “a lot of optimization power”. One way to turn $X$ into an oracle is to ask it a question and decide to give it reward 1 if we like its answer and 0 if we don’t.^[1] Of course, standard AI-safety arguments (e.g., AI takeover and perverse instantiation) suggest that this is a pretty bad idea for powerful $X$ . For the sake of argument, suppose that we can fix all of the “obvious” problems and ensure that X won’t wirehead, won’t try to escape the box we put it in etc., and will only care about the reward it gets for its answer.

Two problems with naive optimizers-turned-oracles: (1) telling the difference between good and awesome answers and (2) answers with hidden flaws: One problem with this type of oracles is that it’s hard to decide whether we like its answers or not. Suppose I ask it for food recommendations for the evening and it suggests pancakes. Pancakes seem fine, although there are some foods that I would like better. So should I reward the AI or not? The second problem is that the oracle optimizes for giving answers that seem good to a human. (Not out of malice, but because “actually being good” isn’t well-defined.) And since humans aren’t omniscient, there will be many seemingly good answers that in fact have disastrous consequences if acted upon.

To address (1), use two AIs: The first problem can be tackled by using two copies of the optimizer and rewarding the one whose answer we like better (flipping a coin if indifferent). This helps because humans - or at least the one writing this text - find it much easier to compare the relative strengths of two options than to try assigning some absolute value to a single option. (E.g., chocolate ice-cream is clearly better than vanilla while the question “On a scale from 0 to 10, how good is vanilla?” is just weird.)

To mitigate (2), have a second AI find the flaws for you: We can decrease the number of unsafe answers coming from the oracle (call it $A$ ) by using a second copy of the optimizer (call it $B$ ) that we ask to expose flaws in A’s answers. To set up the right incentives, we give the reward 1 to B and 0 to $A$ if $B$ finds some flaw (read, “if it seems to us that what it pointed out is a flaw”) and reward 1 to $A$ and 0 to $B$ if it finds none. This helps since there will be flaws that are too hard to find for us, but easy enough to confirm once pointed out. However, it doesn’t solve all problems (since there could be hard-to-confirm flaws or flaws that are too hard for the AI to find). Moreover, there will be things that seem like flaws but aren’t, which gives $B$ a too large advantage. To mitigate this advantage, we can make the flaw-exposure interactive, by allowing $A$ to react to $B$ ’s claims and vice versa.

AI Safety via Debate = Combination of these two modifications: These two improvements can be used simultaneously. The simple way of doing this is by first having $A$ give an answer and $B$ attack it, then having $B$ give an answer and $A$ attack it (both being interactive), and finally deciding which of the two answers we like better. Alternatively, to make the setup more balanced, we can have $A$ and $B$ give answers simultaneously, after which they exchange arguments that either support their own claim or attack the opponent’s (and we pick the answer that seems better). From the outside, this setup can be viewed as a single oracle (since it takes a question as an input and returns the winning answer as an output).

Debate-as-oracle isn’t always safe, but it is safer than naive optimizers-turned-oracles: Denote by $X$ the optimizer we started with, by $O_{X}$ the corresponding naive oracle, and by $O_{D} (X)$ the debate-used-as-oracle. Ultimately, whether $O_{D} (X)$ ’s answer is safe or not depends on how good the human judge is (and whether it can be hacked). If you asked whether the Earth revolves around the Sun in the middle ages, I wouldn’t be sure whether to bet on $O_{D} (X)$ giving you the right answer. However, the individual instances of $X$ in $D (X)$ will avoid giving answers with obvious flaws, since those would lead to losing the debate, which makes it safer than plain $O_{X}$ . In this framing, we need to ask “how safe is debate?”.

Debate with child safety on - only provably safe questions allowed: My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which. In practice, I would thus aim for identifying an as-large-as-possible class of questions for which we have strong reasons to believe that debate is safe (ideally a proof, but not necessarily). And then I would only allow using debate for these safe questions. In this framing, the question becomes “how strong is provably-safe debate?”.

Many well-meaning “improvements” can break debate: One can consider various modifications to the baseline AISvD proposal. On the one hand, there are improvements like OpenAI’s cross-examination which simplifies exposing inconsistencies in claims, which will hopefully make debate strictly safer. On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans - if it seems like the modification would create a conflict of interests, one should be wary.

Conclusion

I have argued that if you use debate as a way of assisting a human with verifying the safety of an AI’s answers, it is critical to have fully adversarial rewards. However, there might be other viable applications of argumentation where things that break “normal” debates from AISvD become desirable instead (such as penalizing both AIs if the human becomes confused). I think it makes sense to pursue such applications. However, to avoid confusion (or worse yet, unpleasant AI-surprises), it is important to be explicit about which application one has in mind.

Incidentally, I feel that the interpretation of debate described in this post is the one that people should use by default in relation to AISvD. (Primarily because if you have a different purpose in mind, such as enhancing the judge’s reasoning, I don’t see good arguments for why this type of debate would be the tool to use.) However, I am quite uncertain about this and would love to know the opinion of people who are closer to the centre of the debate-world :-).

This post was heavily inspired by discussions with Nandi Schoots (and benefited from her comments).

I focus on this scenario, as opposed to the version where you only assign rewards once you have seen what the advice led to. This alternative has its own flaws, and I think that most of the analysis is insensitive to which of the options we pick. Similarly, I suspect that many of the ideas will also apply to the case where debate simply executes a trained policy instead of doing optimization. ↩︎

Planned summary for the Alignment Newsletter:

This post points out that <@debate@>(@AI safety via debate@) relies crucially on creating a zero-sum game in order to ensure that the debaters point out flaws in each other’s arguments. For example, if you modified debate so that both agents are penalized for an inconclusive debate, then an agent may decide not to point out a flaw in an argument if it believes that it has some chance of confusing the judge.

I think that one problem with debate is improper adversarialization.

"Winning the debate" does not refer directly to a particular set of quantum wave functions. What exactly are the AIs optimizing for? Because if the AIs are optimizing for things that seem opposite, but aren't, you get the cooperative hacking failure mode.

Suppose that each AI has a wire in its own circuitry, with its utility function being linear in the voltage in the wire. The human has a switch, that can direct a pulse of 1v to either wire. The AI's work together to persuade the human to to cut open their cases, pull out the wires and attach them to a high voltage power source.

If these systems are something like reinforcement learning agents, then there were many different events that were all strongly correlated in training. These events include large amounts of human DNA (in the form of a pressing finger) being just above the button, the button mechanically moving down and electricity flowing through a wire. When you train 2 agents, they probably won't pick the same proxy. So one of the agents wants to maximize the amount of human DNA near the button, and the other wants no electricity to flow through the wire. So the AI's work together to unplug the button and ram it into the human. (More than just a finger nearby). Making sure that your AI's are totally and perfectly adversarial, and that there isn't some weird state of the world that they both like is hard. I can't see any way to do it with reinforcement learning style approaches.

And actually, its even worse, the AI's have to agree on epistemics as well. If one AI thinks that the Higgs constant is 123, and the other thinks that its 124, they could cooperate to build another AI, one that will try to figure out the Higgs constant, and press the button if and only if the result is 123. (assuming that this is a disagreement that persists under common knowledge, ie its based on a difference in priors. )

It seems even worse than any of that. If your AI wanted anything at all it might debate well in order to survive. So if you are banking on it single-mindedly wanting to win the debate then you were already in deep trouble.

"winning the debate" is a leaky abstraction. There are many possible ways that the debate could be messed with.

The hope of debate as I see it, is that if the AI's are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI's doesn't like it. If one of the AI's can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.

It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn't produce these conditions. Of course, this depends on the context, and it might well not be true.

If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of "winning the debate" is a good abstraction. If, within this domain, the AI's utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.

This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, "make sure that a finger touches this button" and "make sure that electricity flows under this button", assuming these are perfectly correlated during training.

Debate could work with either proxy, so long as both debating AI's use the same proxy.

If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensure that "agents use words to convince the human that their answer is better" is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)

(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.

While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI's signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal - most solution to (B) will be compatible with most solutions to (C).

For (A), Michael Cohen's Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael's recent "AI Debate" Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as "When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.".

I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake of argument, that you have solved all the 'obvious' problems".

Or did you have something else in mind by the first three paragraphs?

I didn't understand the last paragraph. Or rather, I didn't understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.

The point of the last paragraph was that if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want.

If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won't do much of anything. For the ??? AI that doesn't want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.

Also, the debaters better be comparably smart.

if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want

That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how this would happen, but that isn't a particularly strong argument.)

Also, the debaters better be comparably smart.

Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

I haven't yet thought about this in much detail, but here is what I have:

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting.

My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)