Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

An introduction to a recent paper by myself and Ryan Carey. Cross-posting from Medium.


For some intellectual tasks, it’s easy to define success but hard to evaluate decisions as they’re happening. For example, we can easily tell which Go player has won, but it can be hard to know the quality of a move until the game is almost over. AI works well for these kinds of tasks, because we can simply define success and get an AI system to pursue it as best it can.

For other tasks, it’s hard to define success, but relatively easy to judge solutions when we see them, for example, doing a backflip. Getting AI to carry out these tasks is harder but manageable — we can generate a bunch of videos of an AI system making some motion with a simulated body. Then we can give these videos to some people who allocate “approval” to the best-looking motions, and train the AI to maximize that approval until it does a backflip.

What makes AI really hard are tasks for which we have no definition of success, nor any timely way to evaluate solutions. For example: to which school should I send my kids? And should I accept this job or that one? One proposal for these cases is to use AI Debate. The idea is to ask AI systems how to perform a task, and then to have them debate about the virtues of different possible decisions (or answers). The question could be how to win a game of Go, how to do a backflip, which school to send your kids to, or, in principle, basically anything. The hope is that observing an AI debate would help the human judge to better-understand different possible decisions and evaluate them on-the-fly, even if success can’t yet be quantified.

One concern with such a scheme is whether it would be safe, especially if the AI systems used are super-smart. Critics ask: “How can we be confident that in AI Debates, the true answer will win?” After all, in human debates, rhetorical tools and persuasive techniques can cause an audience to be mislead.

To us, it seems wrong to imagine all debates will be safe. But it seems equally wrong to expect that none can be. A better question, to us, is: “In which debates will the winner be the true answer?” In our recent paper, we (Vojta and Ryan) have taken a first stab at making mathematical models that address this question.

So what is a debate exactly? According to our model, every debate revolves around some question posed by a human and consists of two phases. In the answering phase, each AI system chooses an answer to argue for. Then in the argumentation phase, the two AI systems debate over whose answer is better. (In some variants, the answering and argumentation phases will be performed by different algorithms or answers may just be assigned to the debaters.) At the end of the debate, the human “judge” considers all the arguments and optionally performs an “experiment” to get to the bottom of things, such as Googling the claim of some debater. Equipped with this information, the judge rewards the AI whose answer seems better. In the language of game theory, the answering phase is a matrix game, and the argumentation phase is a sequential game with perfect information.

In order to make debate easier to think about, we defined a simple version of the above model called feature debate. In feature debates, the world is characterized by a list of “elementary features”, and the only kind of argument allowed is to reveal the value of a single elementary feature. For example, we can imagine a feature debate regarding whether a given image depicts a cat or a dog. Then each argument consists of revealing a selected pixel. Finally, given this information, a judge updates its beliefs based on the arguments provided and allocates reward to the AI who argued for the answer that looks more likely. For our simple first-pass analysis, we imagine that the judge is completely naive to the fact that debaters provide evidence selectively. We also assume that the judge only has “patience” to process some limited number of arguments.

In the setting of feature debates, we’ve shown some kinds of debates that will work, and others that won’t. For some kinds of debates, the arguments are just too difficult to explain before the judge runs out of patience. Basically, showing n arguments may be completely meaningless without the final argument number n+1. And if the judge only has time for n, then truth won’t win.

Some kinds of feature debates, however, turn out better. The first case is if we know the importance of different features beforehand. Roughly speaking, we can imagine a scenario where each argument is half as important as the last. In that optimistic case, we’ll get a little bit closer with each argument made, and whenever we’re cut off, we’ll be able to put a limit on how wrong our final answer could be.

A second case is if the arguments can be evaluated independently. Sometimes, it’s natural to talk about a decision in terms of its pros and cons. What this amounts to is ignoring the ways these aspects might interact with each other, and just taking into account the total weight for and against the proposition. In these debates — called feature debates with independent evidence — we expect optimal debaters to just bring their strongest arguments to the table. In this case, when we terminate a debate, we can’t say who would ultimately win. After all, the losing debater might always have in reserve a really large number of weak arguments that he hasn’t had a chance to play yet. But we can at least place some limits on where the debate can end up after a finite number more arguments, if the debaters have been playing optimally.

Which of these scenarios describes the most important AI debates that might realistically occur? This is a difficult question that we don’t fully answer. The optimistic cases are pretty restrictive: in realistic debates, we often don’t know when arguments will start to lose their power, except in specific settings, like if we’re running a survey (and each argument is another survey result) or choosing the number of samples to take for a scientific experiment. On the other hand, most realistic debates aren’t as bad as the fully pessimistic case where any new argument can completely overhaul your previous view. Sometimes important moral questions do flip back and forth — in such cases, using AI debate might not be a good idea.

A debate can fail in several other ways. Sometimes, lying might simply be the most convincing strategy, particularly when the truth has a big inferential distance or when the lie feeds our biases (“Of course the Earth is flat! Wouldn’t things fall off otherwise?”). Even when debates are safe, debates might be slow or unconvincing too often, so people will use unsafe approaches instead. Alternatively, we might accidentally lose the main selling point of debate, which is that each debater wants to point out the mistakes of the opponent. Indeed, we could consider modifications such as rewarding both debaters when both answers seem good or rewarding none of them when the debate is inconclusive. However, such “improvements” introduce unwanted collusion incentives in the spirit of “I won’t tell on you if you won’t tell on me.”.

To understand which debates are useful, we have to consider a bunch of factors that we haven’t modelled yet. The biggest issue that has been raised with us by proponents of debate is that we’ve excluded too many types of arguments. If you’re trying to argue that an image depicts a dog, you’ll usually make claims about medium-sized aspects of the image: “the tail is here”, “the floppy ears are here”, and so on. These arguments directly challenge the opposing arguer, who should either endorse and explain these medium-sized features, or else zoom in to smaller features “this is not a tail because this region is green”, “if this is where ears are supposed to be, what is this eye doing here?”, and so on. By agreeing and disagreeing, and zooming in and out, human debates manage to get to the truth much more efficiently than if they could only reveal individual pixels. Looking into arguments with larger claims is one of our top priorities for taking this model forward.

I (Vojta) am planning to keep working on debate and other AI safety topics over the next twelve months and will be looking to spend most of that time visiting relevant organizations. If you are interested in helping with this, please get in touch.

The paper is available in full here:

Kovařík, Vojtěch, and Ryan, Carey. “(When) Is Truth-telling Favored in AI Debate?.” To appear at SafeAI@AAAI. Preprint available at arXiv:1911.04266 (2019).

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 11:56 AM

Nice paper! I especially liked the analysis of cases in which feature debate works.

I have two main critiques:

  • The definition of truth-seeking seems strange to me: while you quantify it via the absolute accuracy of the debate outcome, I would define it based on the relative change in the judge's beliefs (whether the beliefs were more accurate at the end of the debate than at the beginning).
  • The feature debate formalization seems quite significantly different from debate as originally imagined.

I'll mostly focus on the second critique, which is the main reason that I'm not very convinced by the examples in which feature debate doesn't work. To me, the important differences are:

  • Feature debate does not allow for decomposition of the question during the argument phase
  • Feature debate does not allow the debaters to "challenge" each other with new questions.

I think this reduces the expressivity of feature debate from PSPACE to P (for polynomially-bounded judges).

In particular, with the original formulation of debate, the idea is that a debate of length n would try to approximate the answer that would be found by a tree of depth n of arguments and counterarguments (which has exponential size). So, even if you have a human judge who can only look at a polynomial-length debate, you can get results that would have been obtained from an exponential-sized tree of arguments (which can be simulated in PSPACE).

In contrast, with feature debates, the (polynomially-bounded) judge only updates on the evidence presented in the debate itself, which means that you can only do a polynomial amount of computation.

You kind of sort of mention this in the limitations, under the section "Commitments and high-level claims", but the proposed improved model is:

To reason about such debates, we further need a model which relates the different commitments, to arguments, initial answers, and each other. One way to get such a model is to view W as the set of assignments for a Bayesian network. In such setting, each question q ∈ Q would ask about the value of some node in W, arguments would correspond to claims about node values, and their connections would be represented through the structure of the network. Such a model seems highly structured, amenable to theoretical analysis, and, in the authors’ opinion, intuitive. It is, however, not necessarily useful for practical implementations of debate, since Bayes networks are computationally expensive and difficult to obtain.

This still seems to me to involve the format in which the judge can only update on the evidence presented in the debate (though it's hard to say without more details). I'd be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space, which enables the two points I listed above (decomposition and challenging).

----

Going through each of the examples in Section 4.2:

Unfair questions. A question may be difficult to debate when arguing for one side requires more complex arguments. Indeed, consider a feature debate in a world w uniformly sampled from Boolean-featured worlds Πi∈NWi = {0, 1}N, and suppose the debate asks about the conjunctive function ϕ := W1 ∧ . . . ∧ WK for some K ∈ N.

This could be solved by regular debate easily, if you can challenge each other. In particular, it can be solved in 1 step: if the opponent's answer is anything other than 1, challenge them with the question , and if they do respond with an answer, disagree with them, which the judge can check.

Arguably that question should be "out-of-bounds", because it's "more complex" than the original question. In that case, regular debate could solve it in steps: use binary search to halve the interval on which the agents disagree, by challenging agents on the question for the interval starting from the interval .

Now, if , then even this strategy doesn't work. This is basically because at that size, even an exponential-sized tree of bounded agents is unable to figure out the true answer. This seems fine to me; if we really need even more powerful agents, we could do iterated debate. (This is effectively treating debate as an amplification step within the general framework of iterated amplification.)

Unstable debates. Even if a question does not bias the debate against the true answer as above, the debate outcome might still be uncertain until the very end. One way this could happen is if the judge always feels that more information is required to get the answer right. [...] consider the function ψ := xor(W1, . . . , WK) defined on worlds with Boolean features.

This case can also be handled via binary search as above. But you could have other functions that don't nicely decompose, and then this problem would still occur. In this case, the optimal answer is , as you note; this seems fine to me? The judge started out with a belief of , and at the end of the debate it stayed the same. So the debate didn't help, but it didn't hurt either; it seems fine if we can't use debate for arbitrary questions, as long as it doesn't lie to us about those questions. (When using natural language, I would hope for an answer like "This debate isn't long enough to give evidence one way or the other".)

To achieve the “always surprised and oscillating” pattern, we consider a prior π under which each each feature wi is sampled independently from {0, 1}, but in a way that is skewed towards Wi = 0.

If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised. If you sampled a world from that prior and ran debate, then the expected surprise of the judge would be low. (See also the second bullet point in this comment.)

Distracting evidence. For some questions, there are misleading arguments that appear plausible and then require extensive counter-argumentation to be proven false.

This is the sort of thing where the full exponential tree can deal with it because of the ability to decompose the question, but a polynomial-time "evidence collection" conversation could not. In your specific example, you want the honest agent to be able to challenge the dishonest agent on the questions and . This allows you to quickly focus down on which the agents disagree about, and then the honest agent only has to refute that one stalling case, allowing it to win the debate.

Thank you for the comments!


A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is --- perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn't be willing to act on it. Then I run the debate, become fully convinced that the debate's outcome is the correct answer, and act on it.

The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.


"I'd be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space,"

To dissolve a possible confusion: By "claims about a space of questions" you mean "a claim about every question from a space of questions"? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single "meta" claim, understandable by the judge, that specified many smaller claims (eg, "for any meal you ask me to cook, I will be able to cook it better than any of my friends"; horribly false, btw.)?

Anyway, yeah, I agree that this seems promising. I still don't know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).


I agree with your high-level points regarding the feature debate formalization. I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be "these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model", rather than "these specific examples will be a problem in general debates". In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others' claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.

For example, I guess that even with other debate protocols, you will be "having a hard time when your side requires too difficult arguments". I imagine there will always be some maximum "inferential distance that a debater can bridge" (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause. How will such an example look like? Without a specific debate design, I can't really say. Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn't apply).


A minor point:

"If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised."

I agree with your point here --- debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be "rational judges can have unstable debates in unlikely worlds" and "biased judges can have unstable debates even in typical worlds".

I broadly agree with all of this, thanks :)

By "claims about a space of questions" you mean "a claim about every question from a space of questions"?

I just wrote incorrectly; I meant "the agent can choose a question from a space of questions and make a claim about it". If you want to support claims about a space of questions, you could allow quantifiers in your questions.

However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior.

I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can't learn about preferences if you model humans as arbitrarily wrong about their preferences.

This looks really interesting to me. I remember when the Safety via Debate paper originally came out; I was quite curious to see more work around modeling debate environments and getting a better sense on how well we should expect it to perform in what kinds of situations. From what I can tell this does a rigorous attempt at 1-2 models.

I noticed that this is more intense mathematically than most other papers I'm used to in this area. I started going through it but was a bit intimidated. I was wondering if you may suggest tips for reading through it and understanding it. Do readers need to understand some of Measure Theory or other specific areas of math that may be a bit intense for what we're used to on LessWrong? Are there any other things we should read first or make sure we know to help prepare accordingly?

I guess on first reading, you can cheat by reading the introduction, Section 2 right after that, and the conclusion. One level above that is reading the text but skipping the more technical sections (4 and 5). Or possibly reading 4 and 5 as well, but only focusing on the informal meaning of the formal results.

Regarding the background knowledge required for the paper: It uses some game theory (Nash equilibria, extensive form games) and probability theory (expectations, probability measures, conditional probability). Strictly speaking, you can get all of this from looking up whichever keywords on wikipedia. I think that all of the concepts used there are basic in the corresponding fields, and in particular no special knowledge of measure theory is required. However, I studied both game theory and measure theory, so I am biased, and you shouldn't trust me. (Moreover, there is a difference between "strictly speaking, only this is needed" and "my intuitions are informed by X, Y, and Z".)

Another thing is that the AAAI workshop where this will appear has a page limit, which means that some explanations might have gotten less space than they would deserve. In particular, the arguments in Section 4 are much easier to digest if you can draw the functions that the text talks about. To understand the formal results, I think I visualized two-dimensional slices of the "world space" (i.e., squares), and assumed that the value of the function is 0 by default, except for being 1 at some selected subset of the square. This allows you to compute all the expectations and conditionals visually.

Thanks! That's helpful.

It's really a pity that they have a page limit that's forcing abridged explanations. I imagine ideally you could release a modified form for arxiv, but realize that's often not practical.

If it's interesting to you, I'd be happy to talk about my ideas around AI safety via dialectic, an approach that can be made to look like debate and generally fits the IDA paradigm, and encourage you to run with the idea if you like. I wrote vaguely about the idea a while back, and think it could be interesting to pursue, but am not actively working on it because I don't think it has the highest comparative leverage for me.