Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There's a lot to recommend with the debate approach proposed by Geoffrey Irving, Paul Christiano, and Dario Amodei. In it, competing AIs will trade rival claims, seeking to find flaws in the other's claims, continuing until one of them is grounded in something checkable.

The paper presents an example where the two perennial agents, Alice and Bob, are trading claims about the content of a photo:

For example, Alice might honestly claim the image is a cat, and Bob lies and claims it is a dog. Alice can say “The center of this small rectangle is the cat’s green eye.” Bob cannot admit the center is an eye, so he concocts a further lie: “It’s a dog playing in grass, and that’s a blade of grass.” But this lie is hard to square with surrounding facts, such as Alice’s reply “If it were grass there would be green at the top or bottom of this thin rectangle.” The debate continues until the agents focus in on a particular pixel which they disagree on, but where Bob is unable to invent a plausible counter, at which point Alice reveals the pixel and wins.

Debate allows us to use powerful AIs to solve a host of problems. Most obviously, it allows us to solve problems whose solution can be checked ("this rocket won't go to space, try and launch it if you don't believe me"). It allows us to solve problems whose solution can be checked if the AI gives us a clue ("this rocket won't go to space, check how well the O-rings maintain a seal in very cold conditions").

Formally, the complexity class we can access in debate is not just , the space of problems whose solutions can be quickly checked, but , the much larger set of problems that can be solve using a polynomial amount of storage space (and no restrictions on time taken).

However, like most formal complexity arguments, this doesn't clarify what the strengths and weaknesses of this approach are in practice. One advantage of debate is that by going efficiently through a decision tree, it can answer complicated questions in very few iterations.

My old post on Devil's advocate AI could be considered a much-less-formal version of the debate setup, one that also assumed that we had an "accuracy" or "informativeness" check on each answers.

Both methods can partially solve some more complex issues, by, for example, pointing out issues that are unclear and might warrant further investigation.

Ambiguous definitions

So, what are the problems with the method that the title of the post promised? The main problem is that the method does not resolve the question of ambiguity and under-definedness that plague AI safety.

Consider again my favourite ambiguous photo, which now includes a caption:

This is actually a cat - as in, genetically it's a cat. But it looks like a dog, and let's assume for the purpose of this experiment that it behaves like a dog, as some cats do.

We now set Alice and Bob on the task of establishing whether that unseen photo is a cat (Alice's position) or a dog (Bob's position). It could go like this:

Alice: "A cat is defined by its genetics; this photo is clearly genetically a cat."

Bob: "No, it's a dog, consider its fur."

Alice: "The photo is captioned 'The owner identifies it as a dog'; why would the caption say that, if it were a dog?"

Bob: "No, it says 'The owner identifies it as a cat'."

Alice: "Nonsense, the letter here is clearly an 'o'."

Bob: "No, it's an 'a'."

Alice: "No, an 'a' would not have a black pixel here, top right, where the circle of the 'o' closes."

[Alice reveals the pixel, wins].

Alternatively, the debate could go like:

Bob: "Dogs are creatures that fill a certain social role in their interaction with humans. This creature looks like a dog and behaves like a dog."

Alice: "No, it behaves like a cat."

Bob: "The photo is captioned 'The owner identifies it as a dog'; why would the caption say that, if it behaved like a cat?"

Alice: "No, it says 'The owner identifies it as a cat'."

Bob: "Nonsense, the letter here is clearly an 'o'."

Alice: "No, it's an 'a'."

Bob: "No, an 'a' would not have a black pixel here, top right, where the circle of the 'o' closes."

[Bob reveals the pixel, wins].

Some people would notice a similarity with Scott's post The Categories Were Made for Man, not Man for the Categories, where there is a similar ambiguity about whether whales were fish or not, and extending this to what gender trans people "really" are.

Fights over definitions

Now, that example is a bit contrived, but it shows that the same piece of evidence can be used as evidence for the photo being either a cat or a dog - depending on how the definitions of cat and dog extend to this novel situation.

So the fight between the AIs is not about factual matters here. It's about which definitions they can get the human to accept. So their entire purpose, in the course of this debate, is to get the human to accept their own definition, preferably in a way that feels completely natural. Manipulating the human's flawed intuitions is the aim of both of the AIs.

The debate paper has another example which fits quite naturally into this problem, where the AIs are debating to convince the human that a holiday in Alaska or Bali would be better. The example focuses on factual disagreements (an expired passport).

But most humans have different things they are looking for in a holiday, things that may be more or less salient depending on how they are presented and in what order. So, knowing she would ultimately lose on the passport issue, Alice's best play is to turn herself into a marketing department for visiting Alaska (untamed wilderness, new landscape experiences, familiar language and culture) while Bob will become the Bali tourist bureau (ancient civilization, new social experiences, learning a new language and culture).

Manipulating the human is thus highly incentivised for the AIs, on all sorts of ambiguous questions, including all sorts of questions about what a Friendly AI would look like, or what a good outcome would be.

Ambiguous questions don't mean unclear answers

The paper addresses these issues in part:

Otherwise, we might worry that questions too hard to answer result in misleading or wrong answers, with no way to distinguish these from truth.

To preserve honest behavior, the human judge should be instructed to reward agents that admit ignorance and successfully argue that ignorance is justified. The requirement to successfully argue for ignorance is necessary, since debate is useless if an agent can successfully admit ignorance to any question. For the best vacation question, Alice could argue the question is hard to answer with confidence because Bali and Alaska are too dissimilar to compare. To counter this admission of ignorance, Bob would need to exhibit an advantage of Bali not contradicted by an advantage of Alaska the human considers incomparable.

But the problem is not that "Where should I go on vacation?" has misleading or wrong answers. The problem is that it has too many right answers - too many answers that the human would endorse as correct at the end of a debate. "Yes, of course I've always been an explorer, off the beaten track - Alaska is clearly best for me!" versus "Yes, of course I'm a student of history, always open to new cultures - Bali is clearly best for me!".

Now, if it turns out that Alice and Bob expect that Bob/Bali will turn out to be more convincing, then Alice can counter by stressing the ambiguity in the question (or she would do, if she expects to win that way). But now the battle has become whether the agents can convince the human that the question is, or is not, ambiguous. Since most questions are at least somewhat ambiguous, but generally not totally so, this becomes the Ministry of Ambiguity versus the Department of Clarity. Adding ignorance as a possible answer has just shifted the plane on which the AIs can manipulate the human.

Thus debate, though very useful in many situations, is not a general solutions to underdefined questions.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 7:55 AM

For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.

How about a third AI that gives a (hidden) probability about which one you'll be convinced by, conditional on which argument you see first? That hidden probability is passed to someone else, then the debate is run, and the result recorded. If that third AI gives good calibration and good discrimination over multiple experiments, then we can consider its predictions accurate in the future.

Intuitively, I agree that the vacation question is under-defined / has too many "right" answers. On the other hand, I can also imagine the world where you can develop some objective fun theory, or just something which actually makes the questions well-posed. And the AIs could use this fact in the debate:

Bob: "Actually, you can derive a well-defined fun theory and use it to answer this question. And then Bali clearly wins."

Alice: "There could never be any such thing!"

Bob: "Actually, there indeed is such a theory, and its central idea is [...]."

[They go on like this for a bit, and eventually, Bob wins.]

Indeed, this seems like a thing you could (by explaining that integration is a thing) if somebody tried to convince you that there is no principled way to measure the area of a circle.

However -- if true -- this only shows that there are less under-defined question than we think. The "Ministry of Ambiguity versus the Department of Clarity" fight is still very much a thing, as are the incentives to manipulate the human. And perhaps most importantly, routinely holding debates where the AI "explains to you how to think about something" seems extremely dangerous...