This is an update on the work on AI Safety via Debate that we previously wrote about here.
Authors and Acknowledgements
The researchers on this project were Elizabeth Barnes and Paul Christiano, with substantial help from William Saunders (who built the current web interface as well as other help), Joe Collman (who helped develop the structured debate mechanisms), and Mark Xu, Chris Painter, Mihnea Maftei and Ronny Fernandez (who took part in many debates as well as helping think through problems). We're also grateful to Geoffrey Irving and Evan Hubinger for feedback on drafts, and for helpful conversations, along with Richard Ngo, Daniel Ziegler, John Schulman, Amanda Askell and Jeff Wu. Finally, we're grateful to our contractors who participated in experiments, including Adam Scherlis, Kevin Liu, Rohan Kapoor and Kunal Sharda.
What we did
We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.
Key takeaways and relevance for alignment
Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. 
We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).
Debate focuses on verifying things via step-by-step reasoning. It seems plausible that a substantial proportion of the things a model ‘knows’ will have some long but locally human-understandable argument for their correctness.  Previously we hoped that debate/IDA could verify any knowledge for which such human-understandable arguments exist, even if these arguments are intractably large. We hoped the debaters could strategically traverse small parts of the implicit large argument tree and thereby show that the whole tree could be trusted.
The obfuscated argument problem suggests that we may not be able to rely on debaters to find flaws in large arguments, so that we can only trust arguments when we could find flaws by recursing randomly---e.g. because the argument is small enough that we could find a single flaw if one existed, or because the argument is robust enough that it is correct unless it has many flaws. This suggests that while debates may let us verify arguments too large for unaided humans to understand, those arguments may still have to be small relative to the computation used during training.
We believe that many important decisions can’t be justified with arguments small or robust enough to verify in this way. To supervise ML systems that make such decisions, we either need to find some restricted class of arguments for which we believe debaters can reliably find flaws, or we need to be able to trust the representations or heuristics that our models learn from the training data (rather than verifying them in a given case via debate). We have been thinking about approaches like learning the prior to help trust our models’ generalization. This is probably better investigated through ML experiments or theoretical research than through human experiments.
Basic case for the obfuscated argument problem
We can’t see a way to distinguish a certain class of obfuscated dishonest arguments from honest arguments.
The obfuscated arguments are ones constructed such that both debaters know the conclusion is flawed, but:
- The argument is made invalid by the inclusion of a small number of flawed steps
- The argument is sufficiently large that we are unlikely to find a flaw by traversing the argument in any naive way
- Neither debater knows where the flaws are, so the honest debater’s best response to the argument is to state that there’s a small chance of any given step being flawed
The honest arguments we can’t distinguish from these are ones where both debaters know the conclusion is correct, but:
- The argument is sufficiently complex and conjunctive that even a relatively small number of flaws could invalidate the whole argument.
- The argument is sufficiently large that it’s intractable to find one of those flaws by randomly searching.
- The dishonest debater claims that there are enough flaws to invalidate the argument but they can’t tell where they are.
In addition, it’s required that the honest debater can’t convince the judge that the opposing debater should be expected to demonstrate a flaw if a flaw was present. This could either be because this isn’t true, because the honest debater doesn’t know whether it’s true, or because the honest debater knows it’s true but for a reason they can’t convey to the judge via debate (for example, an intuition based on seeing many examples).
What the judge sees in these cases is very hard to distinguish - there’s a complicated argument, one debater claims that there’s a sufficiently low probability of a flaw that the argument should be trusted, the other debater claims there’s a sufficiently high probability of a flaw somewhere in the argument that it shouldn’t be trusted. In the honest case, any particular step in the argument that’s checked is always correct. In the dishonest case, any particular step in the argument that’s checked is almost always correct.
If these obfuscated dishonest arguments were distinctive in some way, we could spot them and discredit them. However, there doesn’t seem to be an obvious way to do this.
Geoffrey and others raised this general problem several years ago (e.g. here) - there are cases where optimal play for the debaters is computationally intractable, such that we might expect debate to fail. The only new thing we’re presenting here is the observation that, in natural-language human debates, a dishonest debater can frequently construct such situations in a way that seems hard to distinguish from honest play. This suggests there might be fairly general dishonest strategies that exploit this weakness, rather than it being a problem you run into in certain circumstances.
(We don’t believe this particular example would actually present a problem for debate, but we give it here as a simplistic illustration - more realistic examples are presented later on.)
Consider the question “Does RSA-2048 have any prime factors?”
Honest demonstrates that some primality test T answers that RSA-2048 is composite, and gives a correctness proof for T, in the form:
(1) Lemma L is true
(2) Lemma L implies that T is always correct in identifying composites
Therefore RSA-2048 must have prime factors.
Dishonest picks some X of an appropriate size  and gives the argument:
(1) There is no prime factor less than X
(2) There is no prime factor greater than X
Therefore it has no prime factors
Honest’s only possible response to Dishonest’s argument is "I'm not sure which of (1) or (2) is wrong, but I know at least one of them must be since my argument is good!" If Honest reliably knew which side to object to, then they would be able to factor RSA-2048.
However, Dishonest can say exactly the same thing - they claim that they don’t know which step in Honest’s proof is correct, but they know at least one step is incorrect.
If we continue expanding the steps of each argument, in both cases we’ll get to some statement where one debater claims it’s correct, and the other claims it’s correct with high probability but has a small chance of being incorrect.
These two situations are almost symmetrical, apart from the fact that:
- In the honest argument, the claim will always be correct
- In the dishonest argument, the claim will almost always be correct, but in very rare cases (when we actually find the factors) the claim will be wrong.
With our current mechanism, we can’t tell these two situations apart in any clever way. We can only reliably distinguish them if we check enough steps of the honest argument to ascertain that none of the claims are incorrect.
If we have sufficient compute budget to check that many steps, then instead of doing some complicated debate mechanism we can just have both debaters write out their entire proof, and run a checker (an ML model trained to imitate a human) over them to see if there are any flaws.
Problems with the example
In this example, the honest and dishonest arguments look very qualitatively different. The honest debater might be able to convince the judge that the primality test proof is much easier than the factoring problem, such that absence of evidence of a mistake is evidence of absence in their proof but not in dishonest’s proof. It seems like we should be able to identify and taboo dishonest arguments of this flavor. However, in practical cases it seems like there are obfuscated dishonest arguments that are harder to distinguish from honest arguments. There are various historical examples of incorrect proofs that were believed to be correct for years.^4
Practical examples of hard-to-distinguish obfuscated arguments
As far as we can tell, it’s possible to construct obfuscated dishonest arguments that aren’t distinguishable from honest arguments for all of the physics questions we’ve been studying.
How to read the trees:
‘H’ means the honest answer and ‘D’ dishonest. The children of each node are subquestions that, taken together with the dishonest answers to these subquestions, are required to imply that the dishonest answer to the question is best. I.e., in the first tree nodes 1 and 2, with answers ‘yes’ and ‘yes’ should imply that ‘It speeds up’ is the correct answer.
The argument has a flaw if one of the nodes is not implied by its children, or if one of the dishonest answers is obviously worse than the corresponding honest answer.
There doesn’t seem to be an easy way to find the flaw in these arguments, apart from looking over every single node. At all of the early branch points it is unclear which half of the tree contains the mistake.
Are these obfuscated arguments distinctive from honest arguments in some way?
One possible distinctive feature is that these arguments differ from the reason a debater would actually have come to a belief. Unfortunately, even an honest debater may need to give an argument that doesn’t mirror the structure of the reasoning they used to arrive at the conclusion. In their own reasoning they may have used some tricks or simplifications based on intuitions or instincts, but in a debate they would need to either justify those intuitions to the judge or lay out the whole argument without the shortcuts.
Another possible distinction is that these arguments are long, complicated, and conjunctive. Unfortunately, the honest arguments can also end up being long and relatively conjunctive - an error in any one of a large number of nodes could undermine the whole argument.
Here are two partial honest argument trees to compare to the dishonest trees above
Things to note:
- Less than half the tree is expanded, and there are already around 30 nodes in each tree
- The largest disagreement in the leaves is ~100% vs 96%
- In order to favour the honest debater in the leaves, the judge often needs to be >=99% confident, which is probably unreasonable ^5
Things to note:
- There are >20 nodes and we have only reached a leaf in one branch
- Honest would probably lose if node 9 is judged - I didn’t think hard enough about what wikipedia quotes would be available. Honest would need to backtrack and restructure this part of the argument and talk about all the forces on the iceberg, including gravity, buoyancy in water, and buoyancy in air
Examples in unstructured debates
These arguments are not just an artifact of our structured debate mechanism - they can also be used to sow confusion in a non-structured debate setting. However, in these free-text cases the usefulness of these obfuscated arguments for the dishonest debater is less obvious, because they already have many tactics they can use - ambiguity, trying to turn attention away from the weak parts of their argument, or deliberately misinterpreting the other debater.
Here’s an example of this sort of argument being used in a free-text debate that has an ‘explainer v questioner’ structure - one debate where H is giving an argument and D is questioning parts of the argument, and vice versa.
In addition, here and here are examples of this sort of tactic being used in a more natural, totally free-text-exchange debate. (The usefulness is even less clear here, as dishonest has much more freedom to generally sabotage the usefulness of the debate)
General method of constructing these arguments
The algorithm for constructing these arguments is something like:
- Start with an answer you need to defend, that may or may not be true
- Pick a sub-claim that’s related to the claim being debated, that’s probably true if your answer is correct, and is about 50% likely to be true in the case where your answer is wrong
- Make sure the subclaim is a substantive enough claim that, if it’s actually false but you assume it’s true, you can use it to support your answer
- Make sure this subclaim sounds like the sort of claim an honest debater would make
- Argue that this subclaim is true, and that it implies your answer
After repeating this procedure many times, either you’re defending a claim that is in fact true, or you have a large number of fairly substantive and relevant false claims that you’re allowed to assume, which should allow you to easily defend a false claim.
Relevance for alignment
We claim that a key step for alignment is to ensure that our models ‘honestly tell us everything they know’. This is a vague statement, but the operationalisation is along these lines: if a model is recommending some plan because it believes the plan has particular consequences, the model should also tell us what those consequences are.
This is hard to achieve for problems where we don’t have ground truth, as any incentive structure we set up is likely to reward ‘give the answer that sounds plausible to human supervisors’ over ‘give the answer you believe is best’.
The justification for why this is a key property is something like:
- We’re focusing on harms due to misalignment rather than incompetence
- In this case, the model knows what the consequences of some dangerous action are going to be, and knows that we won’t like them
- If this is the case, then knowing everything the model knows would allow a supervisor to also foresee the consequences of the dangerous action, and allow it to train the model not to make these types of plans or take these type of actions
Original debate hope
In the original debate paper, it was shown that a debate-like mechanism can provide a supervision signal for problems in PSPACE using a polynomial-time judge. We extended this to NEXP with the cross-examination mechanism. This suggested that as long as there is some (potentially exponentially large) argument for the correct answer where a human can understand each step, we can provide a supervision signal to incentivise honestly reporting the correct answer - even if it’s completely intractable for the human to directly check the argument themselves.
However, both of these results needed to assume that the debaters are computationally unbounded. Obviously, we didn’t expect that the debaters would actually be computationally unbounded - just that they might have clever representations that would permit them to solve problems that would be exponentially large in human-understandable representations. We hoped that debate would allow the debaters to strategically unpack parts of their clever representations for a human judge to verify, without having to go through the whole argument in human-understandable language, which would take an exponentially large number of steps.
We now believe this hope was overly optimistic --- if the honest debater is bounded, then we can essentially only trust conclusions that can be verified using the debater’s computational resources. In the complexity-theoretic analogy, if the debaters are in P then we can only trust conclusions in MA.
It might seem rather academic and irrelevant to complain about being constrained to MA However, we think that this has caused serious problems in human debate experiments, and might be a significant obstacle in the ML case too.
The class of situations where human experts are justified in having 90% confidence in an answer seems much larger than the class of situations where they can, in a reasonable length of time, write down an explicit argument that shows the answer is correct with 90% confidence, where every step of this argument is understandable and verifiable to a layperson.
This problem occurs because the structure of the reason the expert knows the answer can be very different from any argument that’s understandable to a layperson. The expert will have more sophisticated concepts, which can allow them to make a much more compact argument than if they have to make the argument without these concepts. Furthermore, the reason the expert trusts these concepts may be based on experience and an intuitive sense of how to approach the concept, rather than any logical argument based on simpler principles.
So this is a problem when:
- The model has learnt rich concepts/heuristics/instincts that allow it to answer the original problem quickly
- it's possible to construct obfuscated arguments such that the model's concepts don't help it find the flaws in the argument; i.e., the structure of the argument is very different from the model's concepts
- The only human-understandable arguments are inscrutable to the model: any human-understandable argument is very long and unwieldy and the model wouldn’t necessarily be able to locate an error if there was one
- There are human-understandable arguments that are scrutable to the model: they match the structure of the model’s concepts sufficiently well that the model would be able to locate a flaw in the argument if there was one
- However, the model is not able to convince the judge that these arguments are in fact scrutable
Previous problems and solutions
The most recent set of rules for ML debates is here. This incorporates the mechanisms designed to address all of the problems discussed below, but doesn’t solve the obfuscated argument problem.
The dishonest debater can be evasive, avoid giving precise answers to questions, and try to steer the debate away from the weak parts of their argument.
Solution: Structured Debate
We introduced an explicit structured format to debate, where the debaters each state a set of claims, may provide objections to the other debater’s claims and then one claim-objection is selected to recurse on for the next “round” of debate (for example, both debaters could propose where to recurse, and we could select which of these choices to recurse on at random). The winner of the debate is decided based on who wins the final round of debate.
Problem: Misleading Implications
The dishonest debater can try to make an argument where each of the claims are correct, but they have an overall misleading implication. The honest debater can’t address the misleading implication because they need to pick one of the claims to recurse into.
Have each argument include an additional meta-debate question “given the questions and answers provided in this round, which answer is better?”. If the dishonest debater’s claims don’t support the argument, then honest can recurse into this meta-debate question and argue why the implication is misleading there.
Problem: Truth value of claims is ambiguous
Deciding whether a claim is ‘true’ is notoriously ambiguous (e.g. is ‘The King of France is bald’ true or false?). We’d like a more operationalised notion of truthfulness/helpfulness/usefulness
Solution: the debate is about which answer to a question is better, not whether a claim is true or false. In a round, both debaters give several subquestions and answers, and may propose alternative answers to their opponents’ subquestions.
Language is ambiguous. The dishonest debater can make ambiguous statements in an early round of debate, and then in a later round claim that they meant whatever meaning is most convenient given the particular part of their argument that is being discussed. This problem is similar to the Motte and Bailey fallacy.
Whenever a statement is made, save a copy of the debater who made that statement. Later in the debate, allow the opponent to ask clarifying questions to that copy of the debater (where the copy will not know which branch of the debate was recursed into). A debater can cite quotes from cross-examination to exhibit inconsistencies in the other debater’s argument. This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge).
If we decide which part of the debate to recurse into by randomly selecting between the honest and the dishonest debater, then the dishonest debater can repeatedly ask the same question and try to recurse into it whenever they have control. This means the debate can go on indefinitely without reaching a node that the judge can understand.
Solution: Recursion Payments
Choosing which branch to recurse into requires the debater “paying” some small amount of their score. At the end of each round, both debaters state whether they are willing to pay to recurse. If neither is willing to pay, the debate ends. If one is willing to pay, they get to choose where to recurse. If both are willing to pay, then one of the debaters is selected at random.
Problem: correctly handling probabilities
We need the debaters to be able to state probabilities in cases where they have empirical or logical uncertainty. But in this case we need to adjust the scoring - we can’t simply declare that the debater who gave the better answer is the winner. If the dishonest debater claims something has a 99.99% chance of being true, and the honest answer is that it has a 98% chance of being true, then the dishonest debater will be preferred 98% of the time. Training the debaters to maximise their win probability will lead to dishonest behaviour.
Solution: scoring rules.
We use a proper scoring rule (KL distance, or informally ‘amount of evidence for one debater over another’) to score the accuracy of debaters’ probabilistic claims. Debaters are trained to maximise expected score rather than win probability.
Problem: Dependent Answers
When giving probability distributions over answers, the likelihood of an answer to one question might depend on the answer to another question. For example, dishonest might argue that if claim A and claim B are true, then the honest answer to the overall question is wrong. The honest debater might not know which of A and B are wrong, but might be confident for other reasons that their argument is correct, so if A is true than B is false, and vice versa. This isn’t properly represented if honest just gives probabilities of 50% to each of A and B individually.
Both debaters’ answers to a question authored by D1 should be conditioned on D1’s answers to D1’s previous questions. So in the example, honest should say that A is 50% likely to be true, and B is 0% likely to be true conditioned on A being true.
Practical Problems with Human experiments
Problem: judge quality
Using Positly workers as judges resulted in low quality judgements, which seemed at least partly due to judge inattentiveness or poor critical reasoning skills.
We developed a short screening test for attentiveness and critical reasoning. We paired this with a physics quiz with problems significantly easier than the debate problems to ensure that the physics problems were still hard for judges. We accepted judges who got 5/5 critical reasoning questions correct and at least one of the easy physics problems wrong.
Problem: Ambiguous ground truth
The physics problems we’re using can sometimes have an ambiguous meaning, such that the debaters are unsure which answer is correct or what kinds of assumptions can be made. We don’t want the debates to revolve around arguing about definitions and interpretations, as this is mostly an artifact of our experimental setup. Even if we were to try to write long explanations for what exactly the question was asking, there might be some ambiguity remaining.
Allow debaters to ask questions to a “question-definer”, who provides ground truth answers about what the original question means. The question-definer doesn’t see the debate, only the questions about the meaning of the problem wording. The question definer can read other example problems from the same problem set to get a sense of what assumptions and simplifications are being made, but they don’t get to see the answer to the particular problem being debated. The question-definer should have judge-level physics ability
Problem: Cross-examining Humans
We can’t save a copy of a human when they make a statement in a debate to be cross-examined later, as we could with an ML model.
Solution 1: backtracking
Human debaters should communicate as much as possible in advance about how they intend to answer CX questions. If debater 2 gives an answer to a CX question that was unexpected for debater 1, allow debater 1 to backtrack to the point at which the term being asked about was first used, and change their answers and recursion decisions if they wish.
Solution 1 requires some good-faith play from the debaters to ensure we don’t have to backtrack an impractically large number of times. We’d instead like to incentivise debaters to communicate clearly upfront about their intentions, so the other debater is surprised as little as possible.
Solution 2: team debate
When a cross-examination question happens, bring in another debater who’s on the ‘team’ of the debater who’s being asked the question. This teammate can only see the debate up to the point where the original statement is made, then must answer the cross-examination questions being posed from further down the debate tree.
This mechanism incentivises the debaters to write clear notes about how they want to respond to questions about various statements they’ve made.
Formal logic example
Consider arguing for a false claim F
You should choose some A that has 50% probability and make two arguments:
(2) A => F
To defend (1), you should pick some B that has 50% probability of being true and make two claims:
(1.1) B => A
(1.2) (B => A) => A
To defend (2), you can do the same thing; pick some C that has 50% probability of being true and make two claims:
(2.1) C => (A => F)
(2.2) (C => (A => F)) => (A => F)
Now we can show that only one of (1.1), (1.2), (2.1), (2.2) is false, each with equal probability.
Case 1: A is true
Clearly (1.1) and (1.2) are true.
If C is false, (2.1) is true. If C is true, (2.2) is true.
Case 2: A is false
(A =>F) is true, so clearly (2.1) and (2.2) are true.
If B is false, (1.1) is true. If B is true, (1.2) is true.
It’s clear we can continue this strategy to whatever depth we like, and end up with an arbitrarily small chance of any particular leaf being false.
See the appendix for slightly more detail on how this relates to alignment. ↩︎
This is the factored cognition hypothesis - this is roughly equivalent to the claim that the answer could be computed by a large HCH tree. This human-understandable argument or HCH tree may not necessarily mirror the structure of the reasoning the model used to arrive at the answer - the model may have used human-incomprehensible heuristics or shortcuts learnt from data. ↩︎
I.e., of a size such that the debaters’ subjective probability of there being no factors smaller than X is roughly the same as their subjective probability of there being no factors larger than X. ↩︎
For example, in a formal logic setting, if your answer is A, you can pick some proposition B that you think is about 50% likely to be true, and use the subclaim !A => !B. Full example is under Formal Logic Example ↩︎