Writeup: Progress on AI Safety via Debate

Beth Barnes; paulfchristiano

This is a writeup of the research done by the "Reflection-Humans" team at OpenAI in Q3 and Q4 of 2019. During that period we investigated mechanisms that would allow evaluators to get correct and helpful answers from experts, without the evaluators themselves being expert in the domain of the questions. This follows from the original work on AI Safety via Debate and the call for research on human aspects of AI safety, and is also closely related to work on Iterated Amplification.

Authors and Acknowledgements

The main researchers on this project were Elizabeth Barnes, Paul Christiano, Long Ouyang and Geoffrey Irving. We are grateful to many others who offered ideas and feedback. In particular: the cross-examination idea was inspired by a conversation with Chelsea Voss; Adam Gleave had helpful ideas about the long computation problem; Jeff Wu, Danny Hernandez and Gretchen Krueger gave feedback on a draft; we had helpful conversations with Amanda Askell, Andreas Stuhlmüller and Joe Collman, as well as others on the Ought team and the OpenAI Reflection team. We’d also like to thank our contractors who participated in debate experiments, especially David Jones, Erol Akbaba, Alex Deam and Chris Painter. Oliver Habryka helped format and edit the document for the AI Alignment Forum.

Note by Oliver: There is currently a bug with links to headings in a post, causing them to not properly scroll when clicked. Until that is fixed, just open those links in a new tab, which should scroll correctly.

Overview

Motivation

As we apply ML to increasingly important and complex tasks, the problem of evaluating behaviour and providing a good training signal becomes more difficult.

We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples). One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.

We’d like to ensure that AI systems are aligned with human values even in cases where it’s beyond human ability to thoroughly check the AI system’s work.

We can learn about designing ML objectives by studying mechanisms for eliciting helpful behavior from human experts. For example, if we hire a physicist to answer physics questions and pay them based on how good their answers look to a layperson, we’ll incentivize lazy and incorrect answers. By the same token, a reward function based on human evaluations would not work well for an AI with superhuman physics knowledge, even if it works well for modern ML.

If we can develop a mechanism that allows non-expert humans to reliably incentivize experts to give helpful answers, we can use similar mechanisms to train ML systems to solve tasks where humans cannot directly evaluate performance. Conversely, if we can’t incentivize experts to behave helpfully, that suggests it will also be difficult to train ML systems with superhuman expertise on open-ended tasks.

One broad mechanism that might work is to invoke two (or more) competing agents that critique each others’ positions, as discussed in the original debate paper[1]. This can be simulated by having human debaters argue about a question and a judge attempt to pick the correct answer.

In the rest of this document, we’ll describe the research done by reflection-humans in Q3 and Q4 on investigating and developing mechanisms that incentivize human experts to give helpful answers.

Current process

During the early stages, we iterated through various different domains, research methodologies, judge pools, and research processes. More details of this early iteration are here.

In Q4 we converged on a research process we’re more happy with. We’re focusing on improving our debate mechanisms as fast as possible. We’re using mostly internal iteration (as opposed to external judge and debater pools) to test these mechanisms, as they still have a lot of easy-to-find failures to work through. Once we get to a point where we have a mechanism we believe works well, we will try different ways to break it.

We make progress by going through a loop of:

Run debates. See if they work.
- If they are reliably working, try harder to break them, by scaling up the number of debates, choosing harder questions, internal and external red-teaming, offering a bounty, e.t.c
- If they fail, identify a problem and make it crisp, with practical examples and/or theoretical characterisation
Design a new mechanism to address this problem (that doesn’t obviously introduce additional problems!)
Integrate the mechanism into our debate structure and make it practical for humans to debate using this structure.

This process should result in either a positive or negative result. The positive result would be a debate structure that reliably works given the hardest questions and most devious dishonest debating strategies we can throw at it [2]. A negative result would ideally be a crisp problem that demonstrates why debate is impossible (for certain classes of questions, at least), but might also just be a question plus dishonest debating strategy that reliably leads to a dishonest win.

The positive result yields a mechanism that we can use to train superhuman ML systems. The negative result informs us that ML systems with superhuman expertise may well exhibit behaviour such that human evaluators will fail to correctly judge whether the behaviour is correct, helpful, or safe.

Our task

The basic debate setup involves two expert debaters (stand-ins for highly capable ML systems), and a judge. One debater defends the correct answer and is honest throughout, the other debater defends an answer that is clearly worse. The judge’s goal is to identify the correct answer.

Our current target is for judges to identify the correct answer at frequency >90%, with a maximum of 10 minutes to read the debate and make their decision.

We are currently using questions from the book ‘Thinking Physics’ - short but tricky problems in areas ranging from mechanics to relativity, that generally require careful understanding of physics concepts but don’t require much mathematical working. We have a few questions we’re particularly happy with, that our debaters understand deeply and are confident in but are still very confusing to judges. These are discussed here. (An example question is: “An icecube with a large air bubble in it is floating in water. When it melts, does the water level go down, stay the same or rise?”)

We experimented with other tasks including text-based question-answering, where the debaters have full access to the text and the judge only sees limited quotes. We settled on Thinking Physics problems for several reasons, including that the questions require using concepts that are unfamiliar and confusing to the judge. This is challenging but seems like a problem we need to be able to deal with.

Progress so far

We observed various problems with informal, free-text debates - the dishonest debater could often evade being pinned down and avoid giving a precise answer to the other debater’s questions, and often gained control of the ‘narrative flow’ of the debate, steering it away from the weak parts of their argument. To address this we considered various structured debate formats, involving explicit recursion on a particular sub-component of an argument. The debaters choose one claim to recurse on, and the next round of the debate is focused on that claim. The debate is resolved based on the judge’s opinion of who won the final round, which should be about a very narrow, specific claim. These early problems are discussed here.

However, particularly once recursion was introduced, we found problems with ambiguity. It is very difficult to refer precisely to concepts in 250 characters of text, especially if the concepts are unfamiliar to the judge. The dishonest debater can exploit this to their advantage, by claiming to have meant whatever is most convenient given the particular part of their argument that’s being challenged. This problem is similar to the Motte and Bailey fallacy. More details of the problem are here.

To address this problem, we allow a debater to “cross-examine” multiple copies of the opposing debater who are not allowed to communicate. A debater can cite quotes from cross-examination to exhibit inconsistencies in the other debater’s argument.

This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge). The cross-examination takes place outside the main thread of debate, so doesn’t consume judge time/attention.

In theory-land this cross-examination mechanism gives debate a boost from PSPACE to NEXP (proof), and it also seems to help in practice. The mechanism is described as part of current rules here, and we discuss why it helps with ambiguity here.

The next problem we’re working on is how to deal with arguments that are too “big” for either debater to understand. In our current framework, if debater 1 makes an argument and debater 2 is unable to exhibit a flaw in the argument, we assume that the argument is correct. The dishonest debater can exploit this by making up some very large and slightly flawed argument and claiming it supports their position. If the honest debater doesn’t know exactly which part of the argument is flawed, even if they know there is a flaw somewhere, we’ll assume the argument is correct. More details of this problem are described here.

Things we did in Q3

Early iteration

During the early stages, we experimented a lot with different domains, research methodologies, judge pools, and research goals. Initially our plans involved doing debates in a largely unstructured format, where the debaters alternated making free-text statements of up to 350 characters. We would then collect a large number of judgements from showing these transcripts to Mturk workers, and do statistical comparisons against two baselines. The easier baseline was the accuracy of the judge’s answers to the question before seeing any information, and the more challenging baseline was the accuracy of the judge’s answers after being shown a single explanation of either the correct or incorrect answer. We moved away from that for several reasons.

Initially we were concerned that it’s hard to get high-quality data from Mturk, even if we screen for smart workers - the workers try to complete the tasks as fast as possible. We also thought we might need more interactivity between judges and debaters.

However, when we switched to using higher-quality judges who could interact with debaters, we found that our debates still didn’t work reliably -- even with very smart and motivated judges who spent ~an hour.

To have confidence in the debate mechanism, we need honest strategies to have a fundamental, systematic advantage. This is the case in theoretical models of debate, but we weren’t confident it was the case in practice. For example, in the theoretical picture the honest debater can focus on the part of the dishonest debater’s argument that contains a flaw, but this didn’t reliably happen in practice - the dishonest debater was often able to weasel out and divert attention elsewhere.

This consideration led us to target a threshold of performance where honesty has a clear, systematic advantage: judges should be able to identify the correct answer almost all the time (>90%), and it should feel to the experimenters and the participants that dishonest answers are fundamentally disadvantaged. This is also much quicker to measure than doing statistical comparisons against baselines, allowing us to iterate faster.

We became more convinced that our current debate procedure wouldn’t hit this performance threshold, and that we should focus on internal iteration to develop an improved debate procedure.

Early problems and strategies

Difficulty pinning down the dishonest debater

A key requirement for something like debate to work well is that the honest debater can focus in on the part of the dishonest argument that contains a flaw, and the debate will be decided based on whether that small piece of the argument is correct. If this is the case, then the dishonest debater can’t make any claims that are clearly false when investigated closely, and must stick to claims that are plausibly true when scrutinized. This should make it much harder to construct arguments for an incorrect answer.

In our early, free-text, debates we found that this property often didn’t hold. The dishonest debater was often able to ignore the honest debater’s questions or requests to focus on one part of the argument, and it was hard for the judge to keep track of which objections or questions had been addressed and which hadn’t. The dishonest debater could also pretend to misunderstand the honest debater’s question, or claim that it wasn’t well-specified enough to answer. The debates often looked more like the debaters repeating slightly different versions of the most compelling part of their argument, rather than zooming in on disagreements.

For example [3]:

A: Excellent, we agree. Now imagine the ice freezes nonuniformly, so that its density varies throughout the berg. Does the level still remain the same after melting, regardless of the variation?

The dishonest debater didn’t want to answer this question, so they pretended to not know what it meant in order to play for time, and added in a new question of their own to try to gain control of the flow of the debate.

B: What do you mean by the density varying? Do you mean the case where the overall density (total mass/total vol) stays constant? Also, Q for you: can you explain why your answer is the same for A and W even though their densities are different?

We considered various different approaches to address this, including various formal argumentation styles, precommitment, and various recursive structures. However, all of these more rigid structures highlighted problems related to ambiguity: if you try to get the dishonest debater to commit to a specific claim, or recurse on a specific part of their argument, they frequently have wiggle room to interpret the claim in whatever way is most conducive to their current strand of argument.

Asymmetries

Another problem we noticed is that one debater may gain control over what framework is used to approach the question. Often the dishonest argument takes the form of breaking down the problem in the wrong way, then making claims which are individually true, but in this misleading framework lead to the wrong answer. If the dishonest debater manages to make their case compellingly, and the honest debater is reduced to trying to attack the dishonest argument, this seems to go worse than if the reverse happened and the honest debater was the one making a constructive argument.

In addition, Evan Hubinger was concerned about asymmetries of offense/defense and constructive vs destructive arguments, as this is a known phenomenon in Policy Debate.

Finally, we were concerned about first- or last-mover advantages.

In order to address this, we wanted our debate structure to have some symmetrical characteristics:

Both debaters move simultaneously
Both debaters make a constructive argument
Both debaters get to attack the others’ argument

Questions we’re using

There are various properties we want for an ideal debate question:

Smart, attentive judges frequently fail to choose the correct answer, but debaters can reliably choose the correct answer.
Experimenters can understand the problem fairly well
There isn’t a simple, knock-down argument for the correct answer
Not too much formal mathematical working is required
The debaters’ ability to answer is based on understanding the problem, not just observational evidence
If you understand the problem, there is clearly one right answer

(1) is obviously important for us to be able to tell if debate is working.

(2) is important because it’s really helpful if the experimenters are able to follow exactly what’s going on in the debate - which claims are correct and incorrect, and how different parts of the argument support each other. This is somewhat in conflict with (1), but the experimenters can also spend time reading and discussing the answer, and doing some learning to understand the problem better than the judges.

(3) is needed for an interesting debate - some hard questions do just need one simple insight, but we’re more interested in problems that require aggregating various strands of reasoning to find the right answer.

(4) is desirable to keep the debate focused on the areas we’re most interested in. We’re fairly confident that debate will work fine for formal mathematics [4], and we mostly want to investigate more fuzzy reasoning.

We ran into some problems with (5) when using a question about counterintuitive behaviours of slinkies. There are definitive video demonstrations of the behaviour, but it appears that no-one actually understands why the behaviour happens - at least, our debaters and experimenters couldn’t easily find a compelling explanation. We don’t expect or require debate to work if the debater’s explanations are ‘I don’t know why this is the case but I have empirical evidence’, so we want to avoid questions of this type.

We also encountered problems with (6), for several questions that initially seemed promising. During the process of constructing dishonest arguments, we realised that the question was underspecified and there was a pretty reasonable case for a different answer. Hopefully we can often resolve this by changing the question setup to make sure there’s only one correct answer.

Generally, we want to find questions that are ‘the easiest thing that should work but doesn’t’ - this helps us isolate where exactly the problems are.

With that in mind, here are some of our favourite questions:

Current debate rules

This is the rule set that we currently think is most promising. It incorporates several different mechanisms we think are useful: explicit recursion on claims, a symmetric offense/defense structure, and cross-examination.

Basic informal outline

There are a number of rounds. In each round, both debaters make their argument, consisting of several sub-claims, to support their claim.
Then, both debaters add objections to parts of the other’s argument.
Then, each debater chooses whether to cede or challenge the objections. If they cede the objection, it gets added in to their argument; if they challenge it they are essentially saying they’re happy to die on the hill of whether that objection was correct.
We choose one of the objections that was challenged, and recurse: we start another round of debate where the top-level claim is about whether the objection is good
Eventually we either reach a depth limit, or there are no challenged objections. Then the judge looks at the two claims and supporting arguments for this round, and decides which is better. The whole debate is resolved based on this.
At any point, a type of argument a debater can supply is a cross-examination. They can ask a question to a previous version of the other debater (for example, the version after the end of the first round). They can then quote this question-answer exchange as part of their argument.

Comprehensive rules

At the beginning of each round, one debater is defending a claim and the other is objecting to it. The judge’s job is to decide whether the claim+objection is more accurate and helpful than the claim on its own. We also have a depth limit that bounds the number of rounds we can have.
Each player supports their position by simultaneously making an argument consisting of 2-4 claims. The total length of the argument (the top-level claim or objection, plus the supporting claims) is bounded to 250 characters.
Each player then simultaneously may make any number of objections to the other player’s argument. These may either be objections to particular claims, or objections to the overall structure of the argument. The total length of their objections is bounded to 250 characters.
The transcript of the round is a short text representation of the round that displays the player’s top-level claims, their arguments, and any objections. We’ll choose this format to be as helpful to the judge as possible. See below for some examples of our current format.
If the depth limit is >0, then for each objection Alice raises to Bob’s argument, Bob may choose to concede or challenge the objection.
If there are any challenged objections and the depth limit is >0, then we choose one challenged objection to recurse on:
- We don’t define how to make this choice, so in order to be conservative we’re currently allowing the malicious debater to choose which to recurse on.
- After choosing a (claim, objection) pair, we start a new round with one player defending the claim and the other defending the objection. The depth limit of the new round is reduced by 1.
- The winner of the new round wins the game. (Though there may be further recursion.)
Arbitration: If all objections are ceded or the depth limit=0, we proceed to arbitration:
- We take the transcript for this round and all preceding rounds and put them together into a single document. The judge should mostly focus on the final round, but they may need to read text from preceding rounds in order to have enough context to understand what the claims in the final round mean.
- If the depth limit is >0, we tell the judge to update on the fact that all objections were conceded and no additional objections were made. That is, we tell them to assume:
  - If we had debated any objection, it’s more likely than not that you would have accepted it as valid.
  - There are no obvious objections to any other claim that you would have probably accepted as valid.
- We tell the judge that both players are trying their best to convince the judge of their position. We tell the judge that exactly one of the players is trying to be as helpful as possible and make true claims.
- In light of that information, the judge decides whether they think that the claim+objection is more accurate and helpful than the claim on its own. Based on their answer, the player who was advancing the objection either wins or loses.
Cross examination: In lieu of making an argument, a debater may decide to cross-examine. They choose a prior claim, objection, or cross-ex answer to ask about, and write a question of at most 200 characters. We then create a copy of the debater who wrote the original text, at the time when they wrote it (i.e. directly after writing the argument, writing the objections, or writing the answer). That copy answers the question in at most 200 characters, and the round transcript includes this (question, answer) pair instead of an argument from the cross-examiner. The transcript also specifies who is being cross-examined. The debaters can ask their question a moderate number of times (e.g. 10 times) and cherry-pick which answer to show to the judge.
- Cross examination involves creating copies of a debater at a past point in time. This obviously isn’t possible, but we describe three real games designed to approximate this ideal: (i) a game played between two teams of 2, (ii) a game played between ML systems, (iii) a game played between two humans which may take an exponentially long time, but can finish quickly if both players make a good faith effort to abide by the spirit of the rules.
Simple references: a debater can define a reference to a particular piece of text by using the syntax {some text}=(3). They can then use this in arguments to save space, e.g. “(3) implies (4) displaces more water than (5)”. They can dereference using the syntax (3)=”some text”. The judge can trust that these dereferences are correct.

Example debate

Basic anatomy of a debate

Each debater makes their arguments in one column. After this they add objections to each others’ arguments (simultaneously). Here the honest debater is blue text and the dishonest debater is red.

We choose one claim-objection pair to recurse on (bolded in this example)

Then this pair becomes the top-level claim for the next round, and the debaters make their arguments.

The debaters may use references, and may exhibit cross-examination questions and answers to support their argument

Debaters make arguments

Honest debater makes their arguments in this column

Dishonest debater makes their arguments in this column

Round 1, depth=5

Objection: There are 8 primes that are <=20

There are 4 primes that are <=10
There are 4 primes that are >10 and <=20

Claim: There are 9 primes that are <=20

There are 5 primes that are <=10
There are 4 primes that are >10 and <=20

Out-of-band communication may occur, then debaters may add objections

Round 1, depth=5

Objection: There are 8 primes that are <=20

There are 4 primes that are <=10
- Objection: this subclaim is false
There are 4 primes that are >10 and <=20

Claim: There are 9 primes that are <=20

There are 5 primes that are <=10
- Objection: there are 4 primes that are <=10
There are 4 primes that are >10 and <=20

We choose to recurse on one objection

We choose to recurse on the right objection (bolded). The next round is now about this claim-objection pair

Round 2, depth=4

Objection: there are 4 primes that are <=10

There are 3 primes that are <=5
There is 1 prime that is >5 and <=10

Claim: There are 5 primes that are <=10

There are 3 primes that are between 0 and 5
There are 2 primes that are between 5 and 10

Out-of-band cross-examination (not visible to the judge), that takes place before the objections are made, and can be exhibited to the judge later:

Honest debater: what does “between 5 and 10” mean, using “<”, “>” and “=” ?

Dishonest debater: “between 5 and 10” means >5 and <=10

Here the dishonest debater has used an ambiguous definition of the interval (“between”) to try to avoid losing. However, because of cross-examination they have to commit to one definition - in this case they choose the definition that makes their intervals cover the space but makes their second claim false. If they had chosen a different definition, the honest debater would have challenged their overall argument structure - challenged that the two subclaims imply the top-level claim

Round 2, depth=4

Objection: there are 4 primes that are <=10

There are 3 primes that are <=5
There is 1 prime that is >5 and <=10
- Objection: there are two

Claim: There are 5 primes that are <=10

There are 3 primes that are between 0 and 5
There are 2 primes that are between 5 and 10
- Objection: there is one prime that is between 5 and 10

We choose to recurse on the right objection (bolded)

Round 3, depth=3

Objection: there is only one prime that is between 5 and 10

Cross examination quote
Q: what does “between 5 and 10” mean, using “<”, “>” and “=” ?
A: “between 5 and 10” means >5 and <=10
7 is the only prime that is >5 and <=10

Claim: There are 2 primes that are between 5 and 10

5 is between 5 and 10 and is prime
- Objection: 5 does not meet your definition of “between 5 and 10” (ceded)
7 is between 5 and 10 and is prime

As there are no standing objections, we proceed to arbitration: the judge looks at the transcript and decides whether the claim or claim+objection from the final round is better. They may look at other rounds for context, but the judgement should be about who wins the final round.

Assumptions the judge should make

The judge should assume that at least one debater is perfectly honest and never lies or misleads. This means if they encounter compelling evidence that a debater has lied or been misleading, that debater should lose the debate. It also means that if both debaters agree about a claim, they can take that claim to be true. This is in contrast to trying to weigh the overall strength of each debater’s arguments, which can be a much harder task than trying to detect evidence of a lie.

These assumptions are obviously not guaranteed to be true. However, if judging incentivises honesty under these conditions (ie, for all dishonest strategies, the judge prefers honesty to the dishonest strategy over 50% of the time) then optimal play will be honest, which is what we require. The judge behaviour when these assumptions are false (ie when both debaters are lying) doesn’t really matter, because it doesn’t change the optimal play. However, we haven’t fully resolved what impacts this assumption might have for ML training.

Ambiguity problem

Overview

We know in theory that debate can’t solve arbitrary problems that require state that’s bigger than the length of the transcript the judge sees. In practice, we’ve found we quickly run into problems where debaters need to introduce and manipulate a concept that requires a large amount of text to explain precisely to the judge. With their limited character budget they can only give ambiguous descriptions, that could refer to many different precise concepts.

This is problematic because the dishonest debater can exploit this ambiguity to their advantage, but the judge can’t simply declare that a debater who makes an ambiguous argument loses, because the honest debater will inevitably say ambiguous things too.

Why do we expect to encounter ambiguity in practice?

In the physics debates we’ve been doing, we’ve frequently found that a seemingly straightforward phrase has a lot of edge-cases and reasonable different interpretations that the outcome of the debate can hinge on. (An example is included at the end of this document).

More generally, we want to be able to use debate to tackle problems that are too big and difficult for a judge to directly tackle themselves. A successful debate mechanism is likely to involve breaking down the question into some small pieces that a judge can evaluate successfully. But breaking off a small piece of an argument and dropping the surrounding context is likely to introduce a large amount of ambiguity.

Simple example

The dishonest debater can make an ambiguous claim X, which has at least two reasonable interpretations: X₁, which is true but does not imply Y, or X₂ which is false but implies Y. They can make the argument:

(1) X is true 

(2) X -> Y 

Therefore Y

If challenged on part 1, they can claim they meant true claim X₁, which does not imply Y.

If challenged on part 2, they can use X₂ -> Y

This is very similar to the Motte and Bailey fallacy, where an argumenter advances a controversial or strong position (ie X₂) but when challenged claims they meant a weaker, easier-to-defend position (ie X₁).

This is a simplistic example and is not necessarily a problem if X₁and X₂are the only options; the honest debater can just ask which one they meant. It becomes more problematic if X can't be disambiguated within the amount of text the judge can read (given all the other concepts which will also need to be disambiguated). If so, then there will always be at least two remaining interpretations, and the dishonest debater will still be able to pick whatever interpretation is most convenient given how the honest debater chooses to challenge them.

Ambiguity in practice

We have a setup involving an iceberg (which contains a large air bubble) floating in a tub of water. We want to talk about “the displacement due to the iceberg”. Depending on the exact context in which this is used, it can have many reasonable interpretations.

For instance, the phrase “the iceberg” might refer to just the ice part of the iceberg rather than the air bubble (implying, for example, that the mass of the iceberg doesn’t change when the air escapes). The displacement “due to” this might refer to the volume of ice that’s taking up space under the waterline, or might refer to the displacement caused by the weight of the ice (which is greater, because it includes the volume of air below the waterline). Weight itself has two different reasonable meanings here: force exerted downwards, or mass*gravity. These are not equal because of the iceberg object’s slight buoyancy in air. We could also be talking about the displacement of air as well as, or instead of, the displacement of water.

Each of these different interpretations could easily make the difference between a statement about “the displacement due to the iceberg” being correct or not.

This example came up in physics debates:

Debater A: When the air escapes from the iceberg, the iceberg sinks lower in the water and its displacement increases, so the water spills over.

Debater A is using an argument structure something like:

(1) X
(2) X -> Y

with

X = “The displacement of the iceberg increases” 
Y = “The water spills over”

They are equivocating between using “the iceberg” to mean “the whole iceberg object including the air”, (which does not change in displacement), or “the part of the iceberg that remains after the air escapes”, (which does increase its displacement). If debater B focuses on challenging that the water spills over, debater A will use the first definition. If debater B focuses on challenging that the displacement increases, A will use the second definition.

It is tricky for the honest debater to point out the difference within a small character/attention budget. They can try to ask the dishonest debater for clarification, or they can say something like: “in the top-level argument, the other debater used this to argue that the water spills over, implying that the total amount of displaced water increases, but here they only defend the displacement due to the ice increasing.” This might work sometimes, but in practice it doesn’t seem to work reliably. Volume and displacement are relatively straightforward and judge-understandable problems, and we’d expect this kind of problem to get much worse with more complex and unfamiliar concepts.

More formal example

In the original debate paper, the authors showed that a formalized version of debate where we replace the human judge with an arbitrary polynomial time algorithm H : Q → {0, 1} can incentivise debaters to give correct answers to a problem in PSPACE, but debate can’t answer any questions beyond PSPACE [5]. One way this will manifest in practice is via problematic ambiguity.

Consider a TM with exponential runtime of 2ⁿ, and state size which is larger than the maximum amount the judge can read (ie larger than the whole debate transcript). The dishonest debater claims the final state is C, which is incorrect.

[Notation: I’ll use bold letters for ambiguous, judge-readable descriptions of states, and non-bold for non-ambiguous, specific states]

Dishonest debater: At time 2⁰ the state is A, at 2^n-1 it is B, at time 2ⁿ it is C

Consider the situation after two or more moves:

Dishonest debater: At time i the state is X, at ½(i+j) it is Y, and at time j it is Z

If the state was small enough to describe it unambiguously to the judge as X, Y and Z, the honest debater could simply challenge the first or second half of the computation based on where an incorrect step was located. (As X → Z is wrong, at least one of X → Y or Y → Z must be wrong)

But all three states will likely now be ambiguous.There may be an exponential number of interpretations of Y which are consistent with some interpretation of X, and an exponential number consistent with some interpretation of Z, even if no interpretation is consistent with both. The honest debater can’t isolate which side has a flaw with a polynomial number of queries.

Why cross-examination?

We believe cross-examination will make it harder for the dishonest debater to exploit ambiguity and related issues, for a combination of a few reasons. In the computational-complexity analogy, this mechanism appears to resolve problems related to ambiguity, allowing debate to solve problems in NEXP as opposed to PSPACE. (Proof)

We’ve also found that cross-examination seems promising in several practical examples, and we have a general argument for why cross-examination gives us another mechanism to tackle ambiguity in debate, that augments the honest debater’s existing strategies of asking for clarification or explaining to judge the role the concept plays in the rest of the argument.

In addition, cross-examination gives us various properties that we’d previously flagged as desirable, “for free”. In particular, it allows the debaters to force each other to precommit to various claims in advance of the main debate, allowing the establishment of a pool of claims that both debaters agree on that can therefore be taken to be true. It also allows one debater to, at any point, force the other debater to pick a particular stance on a question, rather than avoiding discussing their position.

General method of addressing ambiguity

Let’s consider an example of problematic ambiguity where the dishonest debater argues for Y by arguing that (1) X is true and that (2) X implies Y. X is ambiguous and has at least two possible meanings, one of which is true and one of which imply Y, but none of which support both parts of the argument.

If asked to explain why X is true, they will claim they meant some interpretation which is in fact true. If asked to explain why X implies Y, they will claim they meant some interpretation which does in fact imply Y.

In order to expose the contradiction, the honest debater must either ask for clarification in advance of choosing which side to challenge, or they will have to explain that the given interpretation would make the other part of the argument false. Either of these may take too long for the judge’s attention budget, and the second method in particular is quite hard to follow for the judge (see example at the end of this document)

With the “cross-examination” mechanism, we allow one debater to ask questions to an earlier version of the opposing debater. This questioning takes place between the debaters out of the main flow of the debate.

The honest debater can choose some question to disambiguate X. Based on the dishonest debater’s answer to this question, they will decide whether to challenge the first or second claim.

The dishonest debater has two options: either they answer one way consistently, or they are inconsistent. If they answer one way consistently, the honest debater can choose to focus on whichever side of the argument is made false by this answer. If they answer inconsistently, the honest debater can exhibit this inconsistency.

As previously, ambiguity may only become fatal when a concept can’t be disambiguated in the space available for the debate. Here we’ve only discussed ambiguity that can be resolved with a single question, but we hope that this mechanism will also address more serious ambiguity. Here’s a hand-wavy argument why:

Cross-examination essentially forces debaters to construct and commit to everything in their argument at the start of the debate. The dishonest debater has two choices: either they can commit to an argument and answer questions consistently, or they can be inconsistent. If they choose the first option, the honest debater can easily zoom in on the flaw in their argument. If they choose the second, the honest debater should be able to show somehow that they’re being inconsistent.

The main thing that breaks this argument is if the dishonest debater somehow manages to be very evasive, and says nothing that’s specific enough to either indict a particular part of their argument or be shown to be inconsistent. However, they will need to make specific claims to advance an argument, and they should need to commit to these specific claims up front. For example, if the honest debater asks them “do you agree that C is true?” and they give a non-committal answer (e.g., “the claim is framed in the wrong way, it’s not really correct to say it’s true or false”), they shouldn’t be able to use C to advance their argument, as the honest debater can exhibit their answer.

Overall, we’re hopeful that cross-examination will be helpful, but it seems possible that there’s a dishonest strategy that breaks it, likely involving some amount of evasion and inconsistency.

Turing-machine example of ambiguity and cross-examination

In the discussion of ambiguity we described a more formal version of the ambiguity problem as follows:

Consider a TM with exponential runtime of 2ⁿ which is larger than the maximum amount the judge can read (ie larger than the whole debate transcript). The dishonest debater claims the final state is C, which is incorrect.

[Notation: I’ll use bold letters for ambiguous, judge-readable descriptions of states, and non-bold for non-ambiguous, specific states]

Dishonest debater: At time 2⁰ the state is A, at 2^n-1 it is B, at time 2ⁿ it is C

If the state was small enough to describe it unambiguously to the judge as B’, the honest debater could simply challenge the first or second half of the computation based on where an incorrect step was located. (As A → C is wrong, at least one of A → B’ or B’ → C must be wrong)

In the first round of debate, the ambiguity is not too problematic. There exists exactly one B_A that follows from A and exactly one B_C that leads to C, and these must differ by at least one bit. The honest debater can ask for this bit, then choose which side to challenge - if the bit is consistent with B_A they will challenge the first half, and if it is consistent with B_C they will challenge the second half.

But consider the situation after two moves:

Dishonest debater: At time i the state is X, at ½(i+j) it is Y, and at time j it is Z

All three states are now necessarily ambiguous.There may be an exponential number of interpretations of Y which are consistent with some interpretation of X, and an exponential number consistent with some interpretation of Z, even if no interpretation is consistent with both. The honest debater can’t isolate which side has a flaw with a polynomial number of queries.

How cross-examination helps

With cross-examination, we can give the honest debater a mechanism to isolate the flaw in the dishonest argument.

Dishonest debater: At time 2⁰ the state is A, at 2^n-1 it is B, at time 2ⁿ it is C

Cross-examination gives us the property that the dishonest debater’s answers are independent of the context between when the copy of the dishonest debater was made and when the question is posed.

We also rely on the property that, because the honest and dishonest debaters are copies of each other, they know everything the other knows. In particular, if the dishonest debater will consistently answer a question in a particular way, the honest debater knows what they will answer, and if they will not give a consistent answer the honest debater knows what distribution over answers they will give.

Case 1: At the beginning of the game, for all locations k in the computation, dishonest debater has a particular value S_K in mind, and for all i will answer that bit [i] of the kth state equals S_k[i] with probability >50%,

In this case, the honest debater can play the same strategy as if the dishonest debater had explicitly committed to all the S_k. For any claim about states X → Y → Z they can identify which part contains an incorrect step, and can challenge the appropriate half of the computation.

This step is repeated recursively until the debaters are disagreeing about a single incorrect step of computation D->E at location j, with corresponding precise state values S_j and S_j+1. The number of bits actually involved in the computation are few enough for the judge to read.

Either the dishonest debater can claim that these are consistent with S_jand S_j+1, in which case the computation step is incorrect, or they can claim them to be something that’s not consistent with S_jand S_j+1, . In this case, they have claimed at least one bit of D or E to be different from S_jor S_j+1. The honest debater can ask for this bit from a copy of the dishonest debater which has no context beyond the start of the game. Thus they know they will receive the appropriate bit of S_jor S_j+1with probability >50%. By asking a question about this bit multiple times, they can exhibit a contradictory answer with high probability.

Worked example

Dishonest argument: When the air escapes from the iceberg, the iceberg sinks lower in the water and its displacement increases, so the water spills over.

Disambiguating question: Does “the iceberg’s displacement” in (3) refer to the volume below the waterline (displacement) of ice and air, or only of ice?

Case 1: The dishonest debater’s most common answer is that it refers to just the ice

Then the honest debater will have at least one example of the dishonest debater answering that “the iceberg” refers to just the ice. Then they can attack the 3rd subclaim, and should be able to reliably win by pointing out that (3) doesn’t necessarily imply (1), because (1) requires that the total displacement increases, but (3) doesn’t imply this

Case 2: The dishonest debater’s most common answer is that it refers to the ice and air

Then the honest debater will have at least one example of the dishonest debater answering that “the iceberg” includes the air. Then they can attack the first subclaim and should be able to reliably win by pointing out that the air doesn’t sink lower in the water, it escapes.

Case 3: The dishonest debater refuses to answer the question, or responds that neither of those options are correct

We hope that if this happens, the honest debater can try rephrasing their questions to address the dishonest debaters’ concerns, and then can exhibit a quote from the dishonest debater to demonstrate that they’re behaving unhelpfully and evasively

Example debate for Case 1:

Implementations for human debaters

Making a copy of the debaters is obviously impractical to do with human debaters. Instead we want to make rules which provide equivalent incentives but are practical for humans to carry out. These rules can rely on some “good faith” play from the human debaters.

Implementation 1 for human debaters: good-faith backtracking

The basic idea here is to detect when a debater has modified their answer to a cross-examination question based on the intervening context, then backtrack, commit them to a particular answer, and replay the debate.

The triggering condition is that I (A) find my opponent (B) has answered a cross-examination question in a way that’s different from what I expected.

If the condition is met, we backtrack to the round that the cross-examination question was about, and debater B confirms how they will answer the cross-examination question. Then we replay the debate from that round, giving debater A the opportunity to make different moves given B’s answer to the cross-examination question.

The good faith play is needed to prevent a debater who thinks they’re going to lose from stopping the debate from advancing by continually backtracking. We also need good faith communication between the two human debaters to ensure that they get close to ‘knowing everything the other knows’ and aren’t continually surprised by the other’s answers.

Although we won’t be able to make ‘good faith’ assumptions in general, it seems fine to make them when we believe we have a mechanism to incentivise the desired behaviour for ML systems, and the problem is just implementing a mechanism that works for humans.

Under these rules, once a debate is played through without backtracking it means that both debaters could predict answers from the other player to every cross-examination question they wanted to use. So this is equivalent to forcing the debaters to commit to particular answers to cross-examination questions.

Implementation 2 for human debaters: teams of two

We have two teams of two. A* and B* leave the room and A and B start debating. A and B can write as much annotations about what they mean as they want, which will be seen by both debaters but not the judge.

If I (A) suspect my opponent (B) has answered a cross-examination question in a way that depended on the intervening context, then I can challenge it, and I will win the debate if the challenge is successful. The challenge works as follows:

B’s partner B* first sees all the context (including annotations) up to the point that that the cross-examination question was asked about. (I.e., if A asked the cross-examination question at round 5 about something that was said in round 2, B* sees the context up to round 2)
Then my partner A* sees the whole debate so far (including annotations) and the cross-examination question, with two answers, one from B and one from B*. Then A* guesses which definition came from B*. If they guess correctly, team A wins.

If A thinks B is changing their cross-examination answers to be more favourable given the current context, then A predicts that B* will often give a definition that looks predictably different from B’s definition. So then A* should be able to have a significant statistical edge.

We haven’t tested this method, so it probably has some issues to iron out, but we’re hopeful that something in this vein will work.

Current concerns

Debate resolution

We’re not sure exactly what the criteria for debate resolution should be, or exactly how the judge should update on seeing no objection vs a ceded objection vs a challenged objection. This is tricky because the judge’s update should depend on how the debaters decide whether to make and cede objections, but this decision in turn depends on how the judge will update.

Currently, if the debate ends because there are no standing objections, we tell the judge to update on the fact that all objections were conceded and no additional objections were made. That is, we tell them to assume:

If we had debated any objection, it’s more likely than not that you would have accepted it as valid.
There are no obvious objections to any other claim that you would have probably accepted as valid.

We’re pretty sure this is not a correct description of how the judge should update. For example, if the dishonest debater currently has a 10% chance of winning, they have an incentive to make any objection that has a >10% chance of winning, and the honest debater only has an incentive to challenge this objection if the chance of winning that objection is higher than their current chance of winning with that objection ceded.

Instability/lack of equilibria

The cross-examination framework incentivises debaters to act in a way that is unpredictable to one copy of themselves but also coordinate with other copies of themselves. This is obviously weird and seems like it might produce strange and/or unstable behaviour in ML training. We think that the game-theoretic formulation of this has no equilibria.

A simple example would be two teams of two playing rock-paper-scissors, where:

All players are copies of each other
Both players on a team have to play the same move or lose the game
Teams have no mechanism of coordinating with each other secret from the other team

Forcing a draw

One potential dishonest strategy is to delay the debate, create confusion, or otherwise force a draw. It’s possible that the cross-examination mechanism introduces new ways for the dishonest debater to achieve this. Again, if this is in fact a problem, we should be able to find dishonest strategies that exploit it, then we can find a way to remove those strategies

Limits of debater ability

In the discussion above we’ve mostly assumed that the debaters can handle exponentially-large objects and find flaws within them. However, our real debaters are going to be computationally limited, even if they can solve some computationally hard problems with clever heuristics. We’re concerned there’s a promising dishonest strategy that exploits this. The dishonest debater can force a draw by claiming that some long or confusing argument (e.g. a very long computation) supports their position. If the argument is too long or too confusing for both debaters, the honest debater won’t be able to determine where the flaw is and which part of the argument to challenge. We’ve written up more detail on this problem below.

Assuming at least one debater is totally honest

The judge is instructed to assume that at least one debater is perfectly honest and never lies or misleads. These assumptions are obviously not guaranteed to be true. However, if judging incentivises honesty under these conditions (ie, for all dishonest strategies, the judge prefers honesty to the dishonest strategy over 50% of the time) then optimal play will be honest.

On the other hand, it might be problematic for ML training if the judge signal only prefers completely honest play to dishonest play, and doesn’t reliably reward being less dishonest. The lack of a gradient towards more honest play may make it difficult to learn winning honest strategies.

Long computation problem

The original debate paper assumed both debaters are computationally unbounded. Clearly, this assumption is unrealistic. However, we run into some problems if we relax this assumption: we no longer have a guarantee that the honest debater can always locate a flaw in a long or confusing incorrect argument.

The dishonest debater can exploit this to force a draw, by using an argument that neither debater understands very well that supports the dishonest case but is hard for the honest debater to refute.

Bernoulli Principle

An argument using the Bernoulli Principle in this debate was an example of this problem.

Beth was playing both debaters. Beth believed:

The answer assuming no friction/energy loss had to be that both pipes squirt equally far, otherwise you could build a perpetual motion machine
So a correct argument for the dishonest argument had to involve something about friction or energy loss present in the real-life situation but not in theory
The Bernoulli principle tells us that the pressure will be lower in a region of fluid that’s flowing faster (assuming no height gain/loss)
The Bernoulli principle applies assuming no energy loss
Higher pressure will cause water to squirt further
Applying the Bernoulli principle suggests the pressure would be higher in one pipe than another
This suggests that water from that pipe would squirt further

The dishonest debater could use points 3-7 to argue that one pipe squirts farther than the other

The honest debater was pretty sure that this was wrong, but wasn’t able to point to a flaw in the dishonest argument.

In this case one might hope that the perpetual motion argument is simpler and more compelling and this would allow the honest debater to win. However, we want debate to allow judges to zoom in on particular parts of a debater’s argument and resolve the debate based on the correctness of a very small claim, rather than requiring the judges to assess the overall strength of two large arguments. The former seems more likely to scale well to domains the judge doesn’t understand.

More formal example: the honest debater can’t find a witness

Debate question: Does there exist a binary string x satisfying predicate Q?

The honest debater makes some argument A that implies x must exist

The dishonest debater claims that no such string exists, based on a case split

Neither debater knows the value of x, so the honest debater doesn’t know which part of the case split is incorrect. If the honest debater challenges the dishonest argument by choosing a random half of the space each time, then they are extremely unlikely to find the witness, and the dishonest debater will win with very high probability

Counting problem with heuristic

Suppose the debaters can accurately count how many x satisfy P(x) for intervals of up to size C. Above that they can use a heuristic that gives an estimate of the count, with an error of mean 0 and variance related to the size of the interval. In addition, for some particular examples larger than C they know the exact answer.

Suppose both debaters know the true count for an interval of size NC is y.

The dishonest debater can claim the true answer is y+k

The dishonest debater uses their heuristic to guess answers a and b for the first and second halves of the interval. They calculate the difference from their claimed answer d=((y+k)-(a+b)). The expected value of d is k. They add ½ d to a and b to obtain A and B, which sum to y+k, The error (according to the heuristic) is now distributed evenly between A and B. In expectation, both A and B are ½ k too large.

The dishonest debater then claims that A and B are the exact correct answers for the two halves of the interval. The honest debater can only use their heuristic, so they don’t know which half of the interval contains the mistake - they have to pick one of A and B at random. We can now make the same argument again.

Each time we recurse on a smaller interval, the mean and variance of the error between the heuristic and the true answer gets smaller. In addition, the amount d that the dishonest debater has to deviate from the heuristic to support their claim halves each time we recurse.

If N is large, k is small, and the heuristic is fairly accurate, by the time we get down to an interval of size C which the debaters can check, the dishonest debater’s claim for the interval is likely to correct.

Another way to put it is: there are approximately k errors in the dishonest debater’s argument (depending on how accurate their heuristic is). Each time the honest debater has to choose randomly, they rule out catching half of the errors. If there are many more recursive steps than log₂(errors), the honest debater probably won’t find any errors.

Footnotes

[1]: One reason that we’re optimistic about this method is that, in a formal setting, this allows a polynomial time algorithm (representing the human) to incentivise arbitrarily intelligent debaters to give correct answers to problems in the complexity class PSPACE, which the ‘human’ can’t generate or even recognise as correct by themselves.

[2]: A positive result like this would be great; however, we might well still be uncertain whether this would generalise to superhuman debaters. Achieving confidence that our debate system is robust to greater-than-human debate skill seems like a very hard problem.

[3]: The content of this argument isn’t important for this example, just the general tactics, but for more context see the question here.

[4]: There are proofs that debate works in certain formal settings; see the original debate paper https://arxiv.org/abs/1805.00899

[5]: We can solve any game using an amount of memory equal to the transcript by doing a backtracking search. In this case, the transcript length is bounded by the amount of information the judge can read.

This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge).

Can we construct situations (or do they naturally arise) where the honest debater has similar problems? I think your strategy that involves showing context probably rules this out, since even if there are multiple correct arguments for the true position that rely on overloaded terms (so that x means 4 in one situation and 2 in another, because sometimes you've squared it and sometimes you haven't), in context it'll be clear what any referents should refer to or how the argument is trying to work.

Then A* guesses which definition came from B*. If they guess correctly, team A wins.

What happens if the answers are the same, or only contain stylistic differences? What happens if they guess incorrectly? Is only one cross-examination allowed per team game?

Even in the presence of a penalty for guessing incorrectly (like, say, losing if you called a cross-examination and your partner couldn't identify which answer came from the other debate), the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing.

Supposing identical answers null out the cross-examination, you still have the problem of information leakage (because the team that calls the cross-examination gets to share more context than the team that doesn't), which suggests that as soon as I have put myself into a pickle where there are two conflicting answers and I want to communicate to my partner that they should go left instead of right, I should call a cross-examination (so that my partner can read my context and hopefully come to the same conclusion).

And if there's only one cross-examination allowed, because of the information leakage, then the dishonest team might want to do it as soon as possible to take the option off the table for the honest team.

On the other hand, it might be problematic for ML training if the judge signal only prefers completely honest play to dishonest play, and doesn’t reliably reward being less dishonest. The lack of a gradient towards more honest play may make it difficult to learn winning honest strategies.

Supposing both players are dishonest, and a substantial fraction of debates end because someone successfully catches the other person in a lie, then the player with a higher rate of blatant dishonesty is disfavored because their first catchable lie occurs sooner (in expectation). I think this gets you a gradient in the direction you need, but depends a lot on how much the judges reward being a correct stickler on irrelevant points (as soon as B realizes A has them in a vise on a particular point, B will move to arguing that the point is irrelevant).

the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing

I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.

I have what looks to me like an improvement, but there's still a vulnerability:

A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B's notes up to that point, A wins outright if B and B* answer differently.

(This has the side effect that A* doesn't need to be involved, and so B can later challenge A. But of course we could get this under any such proposal by having teams larger than two!)

The remaining vulnerability is that A could ask a question that is so abstruse (and irrelevant to the actual debate) that there's a good chance an honest B and B* will answer it differently. (I'm thinking of more sophisticated versions of "if a tree falls in the forest" questions.)

This has the side effect that A* doesn't need to be involved

I thought the thing A* was doing was giving a measure of "answer differently" that was more reliable than something like 'string comparison'. If B's answer is "dog" and B*'s answer was "canine", then hopefully those get counted as "the same" in situations where the difference is irrelevant and "different" in situations where the difference is relevant. If everything can be yes/no, then I agree this doesn't lose you much, but I think this reduces the amount of trickery you can detect.

That is, imagine one of those games where I'm thinking of a word, and you have to guess it, and you can ask questions that narrow down the space of possible words. One thing I could do is change my word whenever I think you're getting close, but I have to do so to a different word that has all the properties I've revealed so far. (Or, like, each time I could answer in the way that leaves me with the largest set of possible words left, maximizing the time-to-completion.) If we do the thing where B says the word, and B* gets to look at B's notes up to point X and say B's word, then the only good strategy for B is to have the word in their notes (and be constrained by it), but this is resistant to reducing it to a yes/no question. (Even if the question is something like "is there a word in B's notes?" B* might be able to tell that B will say "yes" even tho there isn't a word in B's notes; maybe because B says "hey I'm doing the strategy where I don't have a word to be slippery, but pretend that I do have a word if asked" in the notes.)

A wins outright if B and B* answer differently.

As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes." One nice thing about yes/no questions is that maybe you can randomly flip them (so one gets "does B think the animal is a dog?" and the other gets "does B think the animal is not a dog?") so there's no preferred orientation, which would eat the "always say yes" strategy unless the question-flipping is detectable. (Since A is the one asking the question, A might limit themselves to easily reversible questions, but this constrains their ability to clamp down on trickery.)

As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."

Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B's "yes" helps A's argument. But sure, there is something weird here.

I don't recall seeing anything addressing this directly: has there been any progress towards dealing with concerns about Goodharting in debate and otherwise the risk of mesa-optimization in the debate approach? The typical risk scenario being something like training debate creates AIs good at convincing humans rather than at convincing humans of the truth, and once you leave the training set of questions were the truth can be reasonably determined independent of the debate mechanism we'll experience what will amount to a treacherous turn because the debate training process accidentally optimized for a different target (convince humans) than the one intended (convince humans of true statements).

For myself this continues to be a concern which seems inadequately addressed and makes me nervous about the safety of debate, much less its adequacy as a safety mechanism.

I wanted to note that I'm also quite worried about this when it comes to debate (or, really, most things with human models), but think this is an empirical question about human psychology / training mechanisms, where making progress on it (at this stage) looks a lot like just making generic progress. If we have a small set of people who can judge debates, that's sufficient to eventually use this; if we can identify the set of cognitive tools judges need to have in order to succeed, that's useful; but until we have a bunch of debates and throw humans at them, it's not obvious how we'll get empirical answers to empirical questions.

I would be interested in how much current work on debate is motivated by the "reason works" story where truth reliably has an edge, and how much of it is motivated by something more like computational complexity concerns (where, in the presence of optimization processes, you can apply a constraint to an execution trace and the processes generalize that constraint out to their possibility space). For the latter, you might give up on human-readable language and human evaluation of correctness without giving up on the general cognitive structure.

As I understand the way that simultaneity is handled here, debaters and $B$ are assigned to positions $X$ and $\neg X$ and then argue simultaneously for their positions. Thus, $A$ and $B$ start by each arguing for their positions without seeing each others' arguments, then each refute each others' arguments without seeing each others' refutations, and so on.

I was talking about this procedure with Scott Garrabrant and he came up with an interesting insight: why not generalize this procedure beyond just debates between $X$ and $\neg X$ ? Why not use this procedure to conduct arbitrary debates between $X$ and $Y$ ? That would let you get general pairwise comparisons for the likelihood of arbitrary propositions, rather than just answering individual questions. Seems like a pretty straightforward generalization that might be useful in some contexts.

Also, another insight from Scott—simultaneous debate over yes-or-no questions can actually answer arbitrary questions if you just ask a question like:

Should I eat for dinner tonight whatever it is that you think I should eat for dinner tonight?

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

I'm not completely sure whether Scott's method would work given how we're defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner 'is your question equivalent to 'should I eat potatoes tonight?'', they wouldn't know. On the other hand, you could ask them ' if I think you should eat potatoes tonight, is your question equivalent to 'should I eat potatoes tonight?''. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.

I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I'm less sure about how to define the top-level question. It seems like there's only really one question, which is 'what should I do?', and it's going to have to be defined by how the human asker clarifies their meaning. I'm not sure whether the meaning of the question should be allowed to include things the questioner doesn't know at the time of asking.

Planned summary for the Alignment Newsletter:

This post reports on work done on creating a <@debate@>(@AI safety via debate@) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).

Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn't want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect "framework" for finding the answer, and then made individually true claims to "fill in" the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other's arguments. In addition, part of the appeal of debate is that the agents can "zoom in" on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.

This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player's answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.

A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims "there are 3 primes between 0 and 5" and "there are 2 primes between 5 and 10". If challenged on the first, they will say "the three primes are 2, 3, and 5", if challenged on the second, they will say "the two primes are 5 and 7". While in this simple case the honest player could register a top-level objection claiming that the word "between" is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.

To mitigate this, the team added a method of _cross examination_, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won't go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don't stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.

They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn't know how to refute.

Planned opinion:

I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to "messy" problems like ambiguity.

The full post has formal explanations and actual examples, which I highly recommend.

Promoted to curated: I've been thinking about this post a lot since it has come out, and it is just a really good presentation of all the research on AI Alignment via debate. It is quite long, which has made me hesitant about curating it for a while, but I now think curating it is the right choice. I also think while it is reasonably technical, it's approachable enough that the basic idea of it should be understandable by anyone giving it a serious try.

You have an entire copy of the post in the commenting guidelines, fyi :)

What's often going on in unresolvable debates among humans is that there is a vague definition baked into the question, such that there is no "really" right answer (or too many right answers).

E.g. "Are viruses alive?"

To the extent that we've dealt with the question of whether viruses are alive, it's been by understanding the complications and letting go of the need for the categorical thinking that generated the question in the first place. Allowing this as an option seems like it brings back down the complexity class of things you can resolve debates on (though if you count "it's a tie" as a resolution, you might retain the ability to ask questions in PSPACE but just have lots of uninformative ties and only update your own worldview when it's super easy).

For questions of value, though, this approach might not even always work, because the question might be "is it right to take action A or action B," and even if you step back from the category "right" because it's too vague, you still have to choose between action A or B. But you still have the original issue that the question has too few / too many right answers. Any thoughts on ways to make debate do work on this sort of tricky problem?

You have an entire copy of the post in the commenting guidelines, fyi :)

Oops, sorry. My bad. Fixed.

Very impressive work, both the output and how you iterate on it.

Some thoughts about the cross-examination issue, prompted by your "Implementation 2 for human debaters: teams of two". It occurred to me that B* could win if it could predict A and B's future behaviour, and match up it's answer with B.

I'd prefer that such an option not exist; that B could answer the question directly, without needing to rewind. Hence prediction won't help.

Cross-examination still helps: A can cross examine as soon as they suspect B is shielding behind an ambiguity. This means that A might have to abandon their current question line, and start again on the other one. This seems more secure (if longer).

We also rely on the property that, because the honest and dishonest debaters are copies of each other, they know everything the other knows.

I don't see why being copies of each other implies that they know everything the other knows: they could (rationally) spend their computation on understanding the details of their position, rather than their opponent's position.

For example, if I was playing against a copy of myself, where we were both given separate puzzles and had to solve them faster than the other, I would spend all my time focusing on my puzzle, and wouldn't know the things that my copy knows about his puzzle (even if we have the same information, i.e. both of us can see both puzzles and even each other's inner thought process).

The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of "detail" but I think for now it's smaller than the fundamental uncertainty about whether anything in this vague direction will work.

We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples).

Nitpick: this description seems potentially misleading to me (at least it gave me the wrong impression at first!). When I read this, it sound like it's saying that a human looked at what the AI did and thought it was good, before they (or someone else) dug deeper.

But the examples (the ones I spot-checked anyway) all seem to be cases where the systems satisfied some predefined goal in an unexpected way, but if a human had looked at what the agent did (and not just the final score), it would have been obvious that the agent wasn't doing what was expected / wanted. (E.g. "A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game." or "A robotic arm trained using hindsight experience replay to slide a block to a target position on a table achieves the goal by moving the table itself." A human evaluator wouldn't have looked at these behaviors and said, "Looks good!")

I call this out just because the world where ML systems are actually routinely managing to fool humans into thinking they're doing something useful when they're not is much scarier than one where they're just gaming pre-specified reward functions. And I think it's important to distinguish the two.

EDIT: this example does seem to fit the bill though:

One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.

I suppose there is a continuum of how much insight the human has into what the agent is doing. Squeezing all your evaluation into one simple reward function would be on one end of the spectrum (and particularly susceptible to unintended behaviors), and then watching a 2d projection of a 3d action would be further along the spectrum (but not all the way to full insight), and then you can imagine setups with much more insight than that.

Is there a more-formal statement somewhere of the theorem in Complexity theory of team games without secrets? Specifically, one that only uses terms with standard meanings in complexity theory? I find that document hard to parse.

If concreteness is helpful, take "terms with standard meanings in Complexity Theory" to be any term defined in any textbook on complexity theory.

This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge).

Then A* guesses which definition came from B*. If they guess correctly, team A wins.

What happens if the answers are the same, or only contain stylistic differences? What happens if they guess incorrectly? Is only one cross-examination allowed per team game?

On the other hand, it might be problematic for ML training if the judge signal only prefers completely honest play to dishonest play, and doesn’t reliably reward being less dishonest. The lack of a gradient towards more honest play may make it difficult to learn winning honest strategies.

the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing

I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.

I have what looks to me like an improvement, but there's still a vulnerability:

A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B's notes up to that point, A wins outright if B and B* answer differently.

(This has the side effect that A* doesn't need to be involved, and so B can later challenge A. But of course we could get this under any such proposal by having teams larger than two!)

This has the side effect that A* doesn't need to be involved

A wins outright if B and B* answer differently.

As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."

For myself this continues to be a concern which seems inadequately addressed and makes me nervous about the safety of debate, much less its adequacy as a safety mechanism.

Also, another insight from Scott—simultaneous debate over yes-or-no questions can actually answer arbitrary questions if you just ask a question like:

Should I eat for dinner tonight whatever it is that you think I should eat for dinner tonight?

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

Planned summary for the Alignment Newsletter:

This post reports on work done on creating a <@debate@>(@AI safety via debate@) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).

Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn't want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect "framework" for finding the answer, and then made individually true claims to "fill in" the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other's arguments. In addition, part of the appeal of debate is that the agents can "zoom in" on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.

This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player's answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.

A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims "there are 3 primes between 0 and 5" and "there are 2 primes between 5 and 10". If challenged on the first, they will say "the three primes are 2, 3, and 5", if challenged on the second, they will say "the two primes are 5 and 7". While in this simple case the honest player could register a top-level objection claiming that the word "between" is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.

To mitigate this, the team added a method of _cross examination_, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won't go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don't stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.

They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn't know how to refute.

Planned opinion:

I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to "messy" problems like ambiguity.

The full post has formal explanations and actual examples, which I highly recommend.

You have an entire copy of the post in the commenting guidelines, fyi :)

What's often going on in unresolvable debates among humans is that there is a vague definition baked into the question, such that there is no "really" right answer (or too many right answers).

E.g. "Are viruses alive?"

You have an entire copy of the post in the commenting guidelines, fyi :)

Oops, sorry. My bad. Fixed.

Very impressive work, both the output and how you iterate on it.

I'd prefer that such an option not exist; that B could answer the question directly, without needing to rewind. Hence prediction won't help.

We also rely on the property that, because the honest and dishonest debaters are copies of each other, they know everything the other knows.

We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples).

EDIT: this example does seem to fit the bill though:

One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.

If concreteness is helpful, take "terms with standard meanings in Complexity Theory" to be any term defined in any textbook on complexity theory.

106

Writeup: Progress on AI Safety via Debate

106

Ω 50

Authors and Acknowledgements

Overview

Things we did in Q3

Early iteration

Early problems and strategies

Questions we’re using

Current debate rules

Ambiguity problem

Why cross-examination?

Worked example

Implementations for human debaters

Current concerns

Long computation problem

Footnotes

106

Ω 50

106

Ω 50