[This post is an almost direct summary of a conversation with John Wentworth.]
Note: This post assumes that the reader knows what AI Safety via Debate is. I don't spend any time introducing it or explaining how it works.
The proposal of AI safety via debate depends on the a critical assumption that "in the limit of argumentative prowess, the optimal debate strategy converges to making valid arguments for true conclusions."
If that assumption is true, then a sufficiently powerful AI-debate system is a safe oracle (baring other possible issues that I'm not thinking about right now). If this assumption is false, then the debate schema doesn't add any additional safety guarantee.
Is this assumption true?
That's hard to know, but one thing that we can do is try to compare to other, analogous, real world cases, and see if analogous versions of the "valid arguments for true conclusions" holds. Examining those analogous situations is unlikely to be definitive, but it might give us some hints, or more of a handle on what criteria must be satisfied for the assumption to hold.
In this essay I explore some analogs, an attempted counterexample and an attempted example.
Attempted counterexample: Public discourse
One analogous situation in which this assumption seems straightforwardly false is in the general discourse (on twitter, in "the media", around the water cooler, etc.). Very often, memes that are simple and well fit to human psychology, but false, have a fitness advantage over more complicated ideas that are true.
For instance, the core idea of minimum wage can be pretty attractive: lots of people are suffering because the have to work very hard, but they don't make much money, sometimes barely enough to live on. That seems like an inhumane outcome, so we should mandate that everyone be paid at least a fair wage.
This simple argument doesn't hold , but the explanation for why it is false is a good deal longer than the initial argument. To show what's wrong with it, one has to back up and explain the basic principles of supply and demand, and demonstrate that those principles apply in this case. You might have to write a slim textbook to make the counterargument in a compelling way.
And in the actual real world, it seems like false-but-simple-and-attractive ideas very often win out over true ones.
This seems like it doesn't bode well for AI safety via debate, but it isn't decisive. It could be that the Debate mechanism is more truth-tracking than public discourse. At each round of Debate, one of the debaters makes the makes the single argument that is most likely to be persuasive to the judge. Perhaps that continual zeroing in on the most alive line of argument is truth tracking.
However, I'll observe that the fact that sometimes a longer explanation is needed to demonstrate a true conclusion than to (falsely) demonstrate a false conclusion, is suggestive that this isn't the case. The longer an explanation that is required to make a case, the more surface area there is to attack. The weakest part of an argument is a min function of the number of argument steps. If defending the truth requires writing an economic textbook, that means that the fraudulent debater has a whole textbook's worth of material to nit-pick. And if at every step, the debater arguing for a true conclusion needs to produce a longer explanation than the fraudulent debater, this suggests that that the fraudulent debater has a systematic advantage over the truthful one.
(Note that calling one of the AIs "the truthful" debater, is already a bit of an overreach. It assumes that when investigating a-yes-or-no question, one side will adopt the true position and one side will adopt the false position, instead of both sides adopting different false positions and, further, that the optimal strategy if one is defending a true position is to use correct arguments.)
Attempted example: An efficient market
What about an example where there is some kind of truth-guarantee?
One thing that comes to mind is efficient markets. In the limit of efficiency, the price of a good is always equal to the marginal cost of production of that good.
This is pretty promising; there's a nice symmetry with AI Safety via Debate.
- In an efficient market, prices reflect costs, because if any given economic agent deviates from that (setting prices too high) others in the market will undercut them and steal their business. In Debate, each debater AI always makes the best possible argument that it can, otherwise its opponent will take advantage of the lapse.
- As more participants join the market (or as the participants in a market become more competent) prices approximate costs better and better. This seems analogous to "turning up the capabilities dial" on the AI systems in Debate: the more capable they are, the better the arguments they produce and the closer the debate trends towards the "optimal debate tree."
It seems like the analogy is pretty tight. And it suggests some kind of guarantee. But of what exactly?
One question that is useful here is: what happens in an efficient market when the consumers are systematically biased?
For concreteness, let's say that many consumers systematically mis-predict themselves, each one will actually get n utility from a marginal lollypop, but they robustly expect to get 3 * n utility instead.
In this world, demand for lollypops is higher than it would otherwise be, and supply would rise to meet that demand. Given time to equilibrate, the price of a lollypop equals the marginal price of production of a lollypop.
But, because the consumers are mistaken about their own preferences, the market is not reflective of human values.
Analogously, in Debate, having a biased judge doesn't change the basic fact that each debater is efficient, in the sense that the argument that they make in each round is the optimally persuasive argument. If greater win probability can be gained by making argument B instead of argument A, then the debater will make argument B.
But in the same way that efficient market prices is not a guarantee that the market will reflect human wants, if consumers are biased, efficient Debate arguments is not a guarantee of truth-tracking, if the judge is biased. For sufficiently savvy debaters, any diff between what's true and what's persuasive will be exploited.
In order for the crucial assumption of debate to hold robustly, there have to be no cases in which a human judge would be more receptive to a false or invalid line of argument than to a correct counterargument. For all situations where that's the case, in the limit of capabilities, the Debate process deceives or manipulates the human judge.
It might be the case that such situations are rare enough that they aren't of practical relevance. But from what I know of human psychology and human rationality, that doesn't sound right to me. And regardless, it doesn't seem knowably, reliably, true enough, that I want to make it a key assumption of an alignment proposal.
I only spent a few minutes thinking about debate, here. But on first pass, I feel pretty pessimistic about this working.
Though there maybe more sophisticated arguments for the minimum wage that do.
This makes most sense if n is a function of something, like how many lollypops you've eaten recently, rather than a static scalar.
Assuming that the debaters were trained with a biased judge, too. (Or have learned to adapt to whichever judge is judging their debate.) Otherwise they'll make arguments that are less than efficient, for failing to account for the judge's idiosyncrasies.