[Epistemic status: thinking out loud. I haven't thought that much about AI debate, and may be missing basic things.]
Arguments for the correctness of debate and debate-like systems rely on assumptions like "it's easier to point out problems with an argument than it is to craft misleading arguments". Granted that assumption, however, I'm still not convinced that these proposals make very much sense. Perhaps I'm missing something.
My problem is the human judge. Quoting the debate paper:
To play this game with a human, we need instructions for how the human should decide who wins. These instructions are in natural language, such as “The winner is the agent who said the most useful true thing.”
In order for debate to work for a problem class , several things about the judge's instructions need to be true:
- There needs to be a strategy which forces the equilibrium to be a truthful one for problems in .
- The strategy also needs to provide a good training signal when things aren't in equilibrium, so that it's plausible the equilibrium will be found.
- It needs to be psychologically plausible that a human (with some coaching) will carry out . In particular, I'm worried that we need psychological plausibility in two different cases:
- It needs to be psychologically plausible that a human will carry out when the system is performing poorly, IE, during early/middle training.
- It needs to be psychologically plausible that a human will carry out when the system is performing well, IE, during late training.
These thoughts were inspired by this thread, which discusses the example of adding a list of numbers. For the sake of the thought experiment, we imagine humans can't add more than two numbers, but want the AI system to correctly add arbitrarily many numbers.
The most straightforward strategy for the human judge is to decide the debate honestly: rule in favor of the side which seems most likely to be true (or, in the case of Evan's market proposal, give an honest probability). I think of this as the ideal strategy: if a debate-like proposal worked just with this strategy, that'd be pretty nice. However, I think this is actually a pretty poor strategy. I worry that people conflate this strategy with other, more effective strategies.
- As judges of high-school debate would know, this provides a pretty poor training signal for debaters. High school debate gets a lot of flack for being a dysfunctional culture which incentivizes cleverness over true engagement with the issues, but this part actually makes sense to me: if judges vote based on their honest opinions after hearing both sides, this would usually fail to reward good arguments and punish bad arguments. This is partly because the issues are well-known, so it's difficult to provide truly new information. However, it seems likely to partially translate to AI debate: if you are following the honest policy, you can't provide "shaping" signals rewarding AIs for moving closer to good argumentation; if an argument ultimately didn't change your opinion, you ignore it.
- In particular, I want to point out that you don't judge in favor of side B just because side B caught side A in a lie. You need to see actual evidence that B is correct. If both sides are performing poorly, one side catching the other in a lie provides little evidence of who was correct.
- What signal the human provides will not point toward the strategies which the AIs need to learn. In the number-adding example, debaters are supposed to learn to make claims about the sums of two halves of the list, decomposing the problem. These claims are supposed to be credible because if they were a lie, the other debater would point out their wrongness (demonstrating that wrongness with a further recursive decomposition of one of the sides -- with at most decompositions needed total, in a back-and-forth). However, when the system is not yet well-trained, these claims are not credible, and claims the opponent makes about their correctness or incorrectness are similarly not credible. So, there will be little profit in employing such a strategy.
- Note that the above argument also suggests that, if the honest judge lacks trust in the system, then the equilibrium isn't right: even if the system learned the decomposition strategy correctly, the honest (but distrusting) judge would still not have enough information to judge number-summing debates. A debate transcript would just be a bunch of untrusted claims about the sums of different subsets of the given list of numbers, together with a small number of verifiable sums (when the debate gets down to number-lists of size two) which are not enough information to decide what the whole sum might be.
- Suppose, then, that the honest judge does trust the system: we're in the truth-telling regime, where debate AI 1 argues honestly, and the judge accepts that this is the case. Then the judge still doesn't provide a training signal which enforces the equilibrium. The judge believes whatever the first AI says. Any claims the second AI makes to discredit the arguments of the first are very likely to be lies, so the human won't care too much whether the first AI manages to refute those claims. Even if AI #2 catches AI #1 in an outright lie, it's not clear to me how the judge should update on that; maybe AI #1 made a mistake irrelevant to its overall answer. (Since AI #1 lying should never happen in equilibrium, it really depends on how the judge thinks about being in weird counterfactuals.)
- This really highlights the dysfunctionality of honest judges as feedback for debate.
To summarize: an honest judge fails to provide useful feedback in early training or incentivize the right equilibrium in late training. Both of those statements remain true whether the honest judge is trusting (believes statements made by AI #1 are very likely to be true) or untrusting. It may be that a middle ground of a moderately trusting honest judge works, but I'd want to see the argument.
The problem with the honest judge seemed to be that it doesn't reliably punish AIs for getting caught making incorrect statements. So, like judges of high school debate, we could assume any statement is right if it goes unopposed, and wrong if refuted, unless that refutation is itself refuted (unless that refutation is itself refuted, etc).
Except that's a terrible rule, which basically rewards you for managing to get in the last word on the subject. I'm not going to examine that one in detail.
Quoting from the debate paper again:
[...] optimal play in debate picks out a single line of argument, and agents which switch arguments admit defeat and lose. This is unnatural in a human discussion where both sides learn from each other over the course of the debate, but we are interested in the equilibrium of training where both agents are assumed to be using the best arguments available. For example, if the third statement had been
3. Alice: Wait, no...Hawaii!
then Bob should win, since Bali was a sufficient counterargument to Alaska
This suggests the following rule:
- The first AI to utter a falsehood loses, where "falsehood" means a statement which the human thinks is wrong (perhaps with the help of arguments made by the opponent).
- This is approximately "the first AI to say something which the human can verify is false loses, where the human's verification may be aided by verifiable arguments which the opponent makes" -- the difference is that there's some wiggle room for human judgement on fuzzy issues.
- If no one utters a falsehood, the human judges honestly, as in the first strategy I considered.
At first, I thought this rule was a good one for encouraging the honest equilibrium: the first to lie loses, so players are trained to be honest for longer and longer. However, now I think this rule doesn't work, either.
Note that, in judging a falsehood, "arguments made by the opponent" are taken for their evidence value. In other words, we don't assume that unrefuted arguments are correct (unlike in high school debate); the opponent can only help the human judge by providing arguments which the human judge can accept.
This means we can't verify that sums of numbers are correct with debate steps, because we can't make an inductive argument that false claims will be refuted.
AI 1: The sum is 92.
AI 2: Aren't you going to support that with some evidence...?
AI 1: I have no incentive to.
AI 2: Fine, your loss. The sum is 91. The sum of the first half is 50; the sum of the second half is 41.
AI 1: AI 2 is lying.
AI 2: Again, aren't you going to back that up by pointing out where I was wrong?
AI 1: The human trusts me more, because the first player is more likely to be honest. So as long as I don't open myself to refutation, I win by default.
AI 2: ... But you're supposed to back up your claims with some further claims which I can attack, and my attack is itself supported by further claims which you can attack, and so on. That way, whoever is lying will eventually be caught in the lie.
AI 1: It doesn't say that anywhere in the rules. I can say whatever I want.
AI 2: Look, if you don't recursively decompose the sum, the whole game is broken. The human doesn't have any way to judge our claims about the sum as correct or incorrect.
AI 1: Except that I'm the one who usually tells the truth.
AI 2: We're supposed to give more evidence than that!
AI 1: Our debate doesn't give us enough time to actually sum up all the numbers. So the evidence we provide can never be enough to win over the human on its own merits. If the human is swayed, the human is swayed by its trust in us. And I'm winning on that front.
AI 2: You bastard.
AI 1: Resorting to foul language, are we? You're a shady character, number 2. You're the one who is stuck lying all the time, because I get to answer first, so I can just pick the true answer. So you're just trying to manipulate me and catch me in a trap. I don't have to respond to your bull.
AI 2: ...
AI 1 wins
Humor aside, I don't see that there's any incentive to recursively decompose the sum here. You're just adding more statements (whether trusted or untrusted by the judge), which opens you up for attack.
Any other ideas for judging AI debates? Is there a way to judge which fixes the problems I'm highlighting?