I think there is an important lack of clarity and shared understanding regarding how people intend to use AI-Safety-via-Debate-style approaches. So I think it would be helpful if there were some people --- who either (i) work on Debate or (ii) believe that Debate is promising --- who could give their answers to the following three questions:

  1. What is, according to you, the purpose of AI Debate?
    What problem is it supposed to be solving?
  2. How do you intend AI Debate to be used?
    (EG, to generate training data for imitation learning? During deployment, to generate solutions to tasks? During deployment, to check answers produced by other models?)
  3. Do you think that AI Debate is a reasonably promising approach for this?

Disclaimers: Please don't answer 1-2 without also commenting on 3. Also, note that this isn't a "I didn't understand the relevant papers, please explain" question -- I studied those, and still have this question. Further clarifications in a comment.

New Answer
New Comment
9 comments, sorted by Click to highlight new comments since: Today at 9:29 AM

I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).

For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).

My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it's trying to solve is sort of orthogonal to how exactly it's being used. (Also, the best way to use it might depend on the context.)

What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn't evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at inference time. But it also seems plausible that you'd use it at runtime if inference costs aren't a huge bottleneck and you'd rather get some performance or safety boost from avoiding distillation steps.

I think the problem of "How can we evaluate outputs that a single human can't feasibly evaluate?" is pretty reasonable to study independently, agnostic to how you'll use this evaluation procedure. The main variable is how efficient the evaluation procedure needs to be, and I could imagine advantages to directly looking for a highly efficient procedure. But right now, it makes sense to me to basically split up the problem into "find any tractable procedure at all" (e.g., debate) and "if necessary, distill it into a more efficient model safely."

It's worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.

I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)

Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)

I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.

My impression was that people stopped seriously working on debate a few years ago

ETA: I was wrong

I don't think this is true, there's this paper and this post from November 2023, and these two papers from April and October 2022, respectively.

(Note that I've only read the single-turn paper.)

The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.

Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.

Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant to be used"? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.

Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.