As far as I can tell, I have a disjoint set of concerns to many of the concerns I've heard expressed in conversations about AI Safety via Debate.
My main concern with AI Debate is this: each debater has an incentive to trick the operator into running code that takes over the world, replaces the operator, and settles the debate in favor of that debater. To get traction on how big a concern this is, let's start with the following question:
How farsighted does a superintelligent chatbot have to be before it becomes dangerous? Let's forget the two agent setup for a moment, and just consider a single agent. For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
If it acts to maximize some function of the very next observation it gets, I'm pretty sure it never constructs an existentially dangerous argument. Call this a horizon-1 agent. If it acts to maximize some function of the next observations it gets, call this a horizon- agent. I won't rehash the AI Box debates, but my intuition is that it is very likely that a horizon- chatbot agent would take over the world to intervene in the provision of observations, and not out-of-the-question that a horizon- agent would as well. (This isn't a great anchor, but my mind went to the length of rebuttals in peer review).
Let's get back to the AI Debate setup, instead of the single agent setup. The existence of an adversary may make it harder for a debater to trick the operator, but if they're both trying to push the operator in dangerous directions, I'm not very comforted by this effect. The probability that the operator ends up trusting one of them doesn't seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.
So that leaves us with the problem of picking , the episode length. This is a little dicey, of course, but we could imagine starting small, and slowly increasing it until we just begin to get useful work out of the system. Call this the tiptoe approach. It leaves something to be desired, but I think there's a decent chance that come AGI, all the best safety proposals will have an element of the tiptoe approach, so I don't advocate dismissing the tiptoe approach out of hand. An important danger with the tiptoe approach here is that different topics of conversation may have wildly different and thresholds. A debate about how to stop bad actors from deploying dangerous AGI may be a particularly risky conversation topic. I'd be curious to hear people's estimates of the risk vs. usefulness of various horizons in the comment section.
So what if AI Debate survives this concern? That is, suppose we can reliably find a horizon-length for which running AI Debate is not existentially dangerous. One worry I've heard raised is that human judges will be unable to effectively judge arguments way above their level. My reaction is to this is that I don't know, but it's not an existential failure mode, so we could try it out and tinker with evaluation protocols until it works, or until we give up. If we can run AI Debate without incurring an existential risk, I don't see why it's important to resolve questions like this in advance.
So that's why I say I seem to have a disjoint set of concerns (disjoint to those I hear voiced, anyway). The concerns I've heard discussed don't concern me much, because they don't seem existential. But I do have a separate concern that doesn't have much to do with interesting machinery of AI Debate, and more to do with classic AI Box concerns.
And now I can't resist plugging my own work: let's just put a box around the human moderator.[Comment thread] See Appendix C for a construction. Have the debate end when the moderator leaves the box. No horizon-tiptoeing required. No incentive for the debaters to trick the moderator into leaving the room to run code to do X, because the debate will have already been settled before the code is run. The classic AI Box is a box with a giant hole in it: a ready-made information channel to the outside world. A box around the moderator is another thing entirely. With a good box, we can deal with finding workable debate-moderation protocols at runtime.