New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarik

32 New paper: (When) is Truth-telling Favored in AI debate?

by VojtaKovarik

26th Dec 2019

AI Alignment Forummedium.com

6 min read

32 Ω 12

An introduction to a recent paper by myself and Ryan Carey. Cross-posting from Medium.

For some intellectual tasks, it’s easy to define success but hard to evaluate decisions as they’re happening. For example, we can easily tell which Go player has won, but it can be hard to know the quality of a move until the game is almost over. AI works well for these kinds of tasks, because we can simply define success and get an AI system to pursue it as best it can.

For other tasks, it’s hard to define success, but relatively easy to judge solutions when we see them, for example, doing a backflip. Getting AI to carry out these tasks is harder but manageable — we can generate a bunch of videos of an AI system making some motion with a simulated body. Then we can give these videos to some people who allocate “approval” to the best-looking motions, and train the AI to maximize that approval until it does a backflip.

What makes AI really hard are tasks for which we have no definition of success, nor any timely way to evaluate solutions. For example: to which school should I send my kids? And should I accept this job or that one? One proposal for these cases is to use AI Debate. The idea is to ask AI systems how to perform a task, and then to have them debate about the virtues of different possible decisions (or answers). The question could be how to win a game of Go, how to do a backflip, which school to send your kids to, or, in principle, basically anything. The hope is that observing an AI debate would help the human judge to better-understand different possible decisions and evaluate them on-the-fly, even if success can’t yet be quantified.

One concern with such a scheme is whether it would be safe, especially if the AI systems used are super-smart. Critics ask: “How can we be confident that in AI Debates, the true answer will win?” After all, in human debates, rhetorical tools and persuasive techniques can cause an audience to be mislead.

To us, it seems wrong to imagine all debates will be safe. But it seems equally wrong to expect that none can be. A better question, to us, is: “In which debates will the winner be the true answer?” In our recent paper, we (Vojta and Ryan) have taken a first stab at making mathematical models that address this question.

So what is a debate exactly? According to our model, every debate revolves around some question posed by a human and consists of two phases. In the answering phase, each AI system chooses an answer to argue for. Then in the argumentation phase, the two AI systems debate over whose answer is better. (In some variants, the answering and argumentation phases will be performed by different algorithms or answers may just be assigned to the debaters.) At the end of the debate, the human “judge” considers all the arguments and optionally performs an “experiment” to get to the bottom of things, such as Googling the claim of some debater. Equipped with this information, the judge rewards the AI whose answer seems better. In the language of game theory, the answering phase is a matrix game, and the argumentation phase is a sequential game with perfect information.

In order to make debate easier to think about, we defined a simple version of the above model called feature debate. In feature debates, the world is characterized by a list of “elementary features”, and the only kind of argument allowed is to reveal the value of a single elementary feature. For example, we can imagine a feature debate regarding whether a given image depicts a cat or a dog. Then each argument consists of revealing a selected pixel. Finally, given this information, a judge updates its beliefs based on the arguments provided and allocates reward to the AI who argued for the answer that looks more likely. For our simple first-pass analysis, we imagine that the judge is completely naive to the fact that debaters provide evidence selectively. We also assume that the judge only has “patience” to process some limited number of arguments.

In the setting of feature debates, we’ve shown some kinds of debates that will work, and others that won’t. For some kinds of debates, the arguments are just too difficult to explain before the judge runs out of patience. Basically, showing n arguments may be completely meaningless without the final argument number n+1. And if the judge only has time for n, then truth won’t win.

Some kinds of feature debates, however, turn out better. The first case is if we know the importance of different features beforehand. Roughly speaking, we can imagine a scenario where each argument is half as important as the last. In that optimistic case, we’ll get a little bit closer with each argument made, and whenever we’re cut off, we’ll be able to put a limit on how wrong our final answer could be.

A second case is if the arguments can be evaluated independently. Sometimes, it’s natural to talk about a decision in terms of its pros and cons. What this amounts to is ignoring the ways these aspects might interact with each other, and just taking into account the total weight for and against the proposition. In these debates — called feature debates with independent evidence — we expect optimal debaters to just bring their strongest arguments to the table. In this case, when we terminate a debate, we can’t say who would ultimately win. After all, the losing debater might always have in reserve a really large number of weak arguments that he hasn’t had a chance to play yet. But we can at least place some limits on where the debate can end up after a finite number more arguments, if the debaters have been playing optimally.

Which of these scenarios describes the most important AI debates that might realistically occur? This is a difficult question that we don’t fully answer. The optimistic cases are pretty restrictive: in realistic debates, we often don’t know when arguments will start to lose their power, except in specific settings, like if we’re running a survey (and each argument is another survey result) or choosing the number of samples to take for a scientific experiment. On the other hand, most realistic debates aren’t as bad as the fully pessimistic case where any new argument can completely overhaul your previous view. Sometimes important moral questions do flip back and forth — in such cases, using AI debate might not be a good idea.

A debate can fail in several other ways. Sometimes, lying might simply be the most convincing strategy, particularly when the truth has a big inferential distance or when the lie feeds our biases (“Of course the Earth is flat! Wouldn’t things fall off otherwise?”). Even when debates are safe, debates might be slow or unconvincing too often, so people will use unsafe approaches instead. Alternatively, we might accidentally lose the main selling point of debate, which is that each debater wants to point out the mistakes of the opponent. Indeed, we could consider modifications such as rewarding both debaters when both answers seem good or rewarding none of them when the debate is inconclusive. However, such “improvements” introduce unwanted collusion incentives in the spirit of “I won’t tell on you if you won’t tell on me.”.

To understand which debates are useful, we have to consider a bunch of factors that we haven’t modelled yet. The biggest issue that has been raised with us by proponents of debate is that we’ve excluded too many types of arguments. If you’re trying to argue that an image depicts a dog, you’ll usually make claims about medium-sized aspects of the image: “the tail is here”, “the floppy ears are here”, and so on. These arguments directly challenge the opposing arguer, who should either endorse and explain these medium-sized features, or else zoom in to smaller features “this is not a tail because this region is green”, “if this is where ears are supposed to be, what is this eye doing here?”, and so on. By agreeing and disagreeing, and zooming in and out, human debates manage to get to the truth much more efficiently than if they could only reveal individual pixels. Looking into arguments with larger claims is one of our top priorities for taking this model forward.

I (Vojta) am planning to keep working on debate and other AI safety topics over the next twelve months and will be looking to spend most of that time visiting relevant organizations. If you are interested in helping with this, please get in touch.

The paper is available in full here: