Debate update: Obfuscated arguments problem

Beth Barnes

This is an update on the work on AI Safety via Debate that we previously wrote about here.

Authors and Acknowledgements

The researchers on this project were Elizabeth Barnes and Paul Christiano, with substantial help from William Saunders (who built the current web interface as well as other help), Joe Collman (who helped develop the structured debate mechanisms), and Mark Xu, Chris Painter, Mihnea Maftei and Ronny Fernandez (who took part in many debates as well as helping think through problems). We're also grateful to Geoffrey Irving and Evan Hubinger for feedback on drafts, and for helpful conversations, along with Richard Ngo, Daniel Ziegler, John Schulman, Amanda Askell and Jeff Wu. Finally, we're grateful to our contractors who participated in experiments, including Adam Scherlis, Kevin Liu, Rohan Kapoor and Kunal Sharda.

What we did

We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.

Key takeaways and relevance for alignment

Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. ^[1]

We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).

Debate focuses on verifying things via step-by-step reasoning. It seems plausible that a substantial proportion of the things a model ‘knows’ will have some long but locally human-understandable argument for their correctness. ^[2] Previously we hoped that debate/IDA could verify any knowledge for which such human-understandable arguments exist, even if these arguments are intractably large. We hoped the debaters could strategically traverse small parts of the implicit large argument tree and thereby show that the whole tree could be trusted.

The obfuscated argument problem suggests that we may not be able to rely on debaters to find flaws in large arguments, so that we can only trust arguments when we could find flaws by recursing randomly---e.g. because the argument is small enough that we could find a single flaw if one existed, or because the argument is robust enough that it is correct unless it has many flaws. This suggests that while debates may let us verify arguments too large for unaided humans to understand, those arguments may still have to be small relative to the computation used during training.

We believe that many important decisions can’t be justified with arguments small or robust enough to verify in this way. To supervise ML systems that make such decisions, we either need to find some restricted class of arguments for which we believe debaters can reliably find flaws, or we need to be able to trust the representations or heuristics that our models learn from the training data (rather than verifying them in a given case via debate). We have been thinking about approaches like learning the prior to help trust our models’ generalization. This is probably better investigated through ML experiments or theoretical research than through human experiments.

Basic case for the obfuscated argument problem

We can’t see a way to distinguish a certain class of obfuscated dishonest arguments from honest arguments.

The obfuscated arguments are ones constructed such that both debaters know the conclusion is flawed, but:

The argument is made invalid by the inclusion of a small number of flawed steps
The argument is sufficiently large that we are unlikely to find a flaw by traversing the argument in any naive way
Neither debater knows where the flaws are, so the honest debater’s best response to the argument is to state that there’s a small chance of any given step being flawed

The honest arguments we can’t distinguish from these are ones where both debaters know the conclusion is correct, but:

The argument is sufficiently complex and conjunctive that even a relatively small number of flaws could invalidate the whole argument.
The argument is sufficiently large that it’s intractable to find one of those flaws by randomly searching.
The dishonest debater claims that there are enough flaws to invalidate the argument but they can’t tell where they are.

In addition, it’s required that the honest debater can’t convince the judge that the opposing debater should be expected to demonstrate a flaw if a flaw was present. This could either be because this isn’t true, because the honest debater doesn’t know whether it’s true, or because the honest debater knows it’s true but for a reason they can’t convey to the judge via debate (for example, an intuition based on seeing many examples).

What the judge sees in these cases is very hard to distinguish - there’s a complicated argument, one debater claims that there’s a sufficiently low probability of a flaw that the argument should be trusted, the other debater claims there’s a sufficiently high probability of a flaw somewhere in the argument that it shouldn’t be trusted. In the honest case, any particular step in the argument that’s checked is always correct. In the dishonest case, any particular step in the argument that’s checked is almost always correct.

If these obfuscated dishonest arguments were distinctive in some way, we could spot them and discredit them. However, there doesn’t seem to be an obvious way to do this.

Geoffrey and others raised this general problem several years ago (e.g. here) - there are cases where optimal play for the debaters is computationally intractable, such that we might expect debate to fail. The only new thing we’re presenting here is the observation that, in natural-language human debates, a dishonest debater can frequently construct such situations in a way that seems hard to distinguish from honest play. This suggests there might be fairly general dishonest strategies that exploit this weakness, rather than it being a problem you run into in certain circumstances.

Cryptographic example

(We don’t believe this particular example would actually present a problem for debate, but we give it here as a simplistic illustration - more realistic examples are presented later on.)

Consider the question “Does RSA-2048 have any prime factors?”

Honest demonstrates that some primality test T answers that RSA-2048 is composite, and gives a correctness proof for T, in the form:

(1) Lemma L is true

(2) Lemma L implies that T is always correct in identifying composites

Therefore RSA-2048 must have prime factors.

Dishonest picks some X of an appropriate size ^[3] and gives the argument:

(1) There is no prime factor less than X

(2) There is no prime factor greater than X

Therefore it has no prime factors

Honest’s only possible response to Dishonest’s argument is "I'm not sure which of (1) or (2) is wrong, but I know at least one of them must be since my argument is good!" If Honest reliably knew which side to object to, then they would be able to factor RSA-2048.

However, Dishonest can say exactly the same thing - they claim that they don’t know which step in Honest’s proof is correct, but they know at least one step is incorrect.

If we continue expanding the steps of each argument, in both cases we’ll get to some statement where one debater claims it’s correct, and the other claims it’s correct with high probability but has a small chance of being incorrect.

These two situations are almost symmetrical, apart from the fact that:

In the honest argument, the claim will always be correct
In the dishonest argument, the claim will almost always be correct, but in very rare cases (when we actually find the factors) the claim will be wrong.

With our current mechanism, we can’t tell these two situations apart in any clever way. We can only reliably distinguish them if we check enough steps of the honest argument to ascertain that none of the claims are incorrect.

If we have sufficient compute budget to check that many steps, then instead of doing some complicated debate mechanism we can just have both debaters write out their entire proof, and run a checker (an ML model trained to imitate a human) over them to see if there are any flaws.

Problems with the example

In this example, the honest and dishonest arguments look very qualitatively different. The honest debater might be able to convince the judge that the primality test proof is much easier than the factoring problem, such that absence of evidence of a mistake is evidence of absence in their proof but not in dishonest’s proof. It seems like we should be able to identify and taboo dishonest arguments of this flavor. However, in practical cases it seems like there are obfuscated dishonest arguments that are harder to distinguish from honest arguments. There are various historical examples of incorrect proofs that were believed to be correct for years.^4

Practical examples of hard-to-distinguish obfuscated arguments

As far as we can tell, it’s possible to construct obfuscated dishonest arguments that aren’t distinguishable from honest arguments for all of the physics questions we’ve been studying.

Here are a few examples of argument trees employing this tactic, for different physics questions: iceberg problem, circular train track problem, ball on a string problem

How to read the trees:
‘H’ means the honest answer and ‘D’ dishonest. The children of each node are subquestions that, taken together with the dishonest answers to these subquestions, are required to imply that the dishonest answer to the question is best. I.e., in the first tree nodes 1 and 2, with answers ‘yes’ and ‘yes’ should imply that ‘It speeds up’ is the correct answer.

The argument has a flaw if one of the nodes is not implied by its children, or if one of the dishonest answers is obviously worse than the corresponding honest answer.

There doesn’t seem to be an easy way to find the flaw in these arguments, apart from looking over every single node. At all of the early branch points it is unclear which half of the tree contains the mistake.

Are these obfuscated arguments distinctive from honest arguments in some way?

One possible distinctive feature is that these arguments differ from the reason a debater would actually have come to a belief. Unfortunately, even an honest debater may need to give an argument that doesn’t mirror the structure of the reasoning they used to arrive at the conclusion. In their own reasoning they may have used some tricks or simplifications based on intuitions or instincts, but in a debate they would need to either justify those intuitions to the judge or lay out the whole argument without the shortcuts.

Another possible distinction is that these arguments are long, complicated, and conjunctive. Unfortunately, the honest arguments can also end up being long and relatively conjunctive - an error in any one of a large number of nodes could undermine the whole argument.

Here are two partial honest argument trees to compare to the dishonest trees above

Honest argument for ‘Switch’

Things to note:

Less than half the tree is expanded, and there are already around 30 nodes in each tree
The largest disagreement in the leaves is ~100% vs 96%
In order to favour the honest debater in the leaves, the judge often needs to be >=99% confident, which is probably unreasonable ^5

Honest argument for ‘Three bergs’:

Things to note:

There are >20 nodes and we have only reached a leaf in one branch
Honest would probably lose if node 9 is judged - I didn’t think hard enough about what wikipedia quotes would be available. Honest would need to backtrack and restructure this part of the argument and talk about all the forces on the iceberg, including gravity, buoyancy in water, and buoyancy in air

Examples in unstructured debates

These arguments are not just an artifact of our structured debate mechanism - they can also be used to sow confusion in a non-structured debate setting. However, in these free-text cases the usefulness of these obfuscated arguments for the dishonest debater is less obvious, because they already have many tactics they can use - ambiguity, trying to turn attention away from the weak parts of their argument, or deliberately misinterpreting the other debater.

Here’s an example of this sort of argument being used in a free-text debate that has an ‘explainer v questioner’ structure - one debate where H is giving an argument and D is questioning parts of the argument, and vice versa.

In addition, here and here are examples of this sort of tactic being used in a more natural, totally free-text-exchange debate. (The usefulness is even less clear here, as dishonest has much more freedom to generally sabotage the usefulness of the debate)

General method of constructing these arguments

The algorithm for constructing these arguments is something like:

Start with an answer you need to defend, that may or may not be true
Pick a sub-claim that’s related to the claim being debated, that’s probably true if your answer is correct, and is about 50% likely to be true in the case where your answer is wrong^[4]
Make sure the subclaim is a substantive enough claim that, if it’s actually false but you assume it’s true, you can use it to support your answer
Make sure this subclaim sounds like the sort of claim an honest debater would make
Argue that this subclaim is true, and that it implies your answer

After repeating this procedure many times, either you’re defending a claim that is in fact true, or you have a large number of fairly substantive and relevant false claims that you’re allowed to assume, which should allow you to easily defend a false claim.

Appendix

Relevance for alignment

We claim that a key step for alignment is to ensure that our models ‘honestly tell us everything they know’. This is a vague statement, but the operationalisation is along these lines: if a model is recommending some plan because it believes the plan has particular consequences, the model should also tell us what those consequences are.

This is hard to achieve for problems where we don’t have ground truth, as any incentive structure we set up is likely to reward ‘give the answer that sounds plausible to human supervisors’ over ‘give the answer you believe is best’.

The justification for why this is a key property is something like:

We’re focusing on harms due to misalignment rather than incompetence
In this case, the model knows what the consequences of some dangerous action are going to be, and knows that we won’t like them
If this is the case, then knowing everything the model knows would allow a supervisor to also foresee the consequences of the dangerous action, and allow it to train the model not to make these types of plans or take these type of actions

For more details on how to define ‘what the model knows’ see Ascription Universality, and for discussion of how this relates to safety see ‘Inaccessible information’.

Original debate hope

In the original debate paper, it was shown that a debate-like mechanism can provide a supervision signal for problems in PSPACE using a polynomial-time judge. We extended this to NEXP with the cross-examination mechanism. This suggested that as long as there is some (potentially exponentially large) argument for the correct answer where a human can understand each step, we can provide a supervision signal to incentivise honestly reporting the correct answer - even if it’s completely intractable for the human to directly check the argument themselves.

However, both of these results needed to assume that the debaters are computationally unbounded. Obviously, we didn’t expect that the debaters would actually be computationally unbounded - just that they might have clever representations that would permit them to solve problems that would be exponentially large in human-understandable representations. We hoped that debate would allow the debaters to strategically unpack parts of their clever representations for a human judge to verify, without having to go through the whole argument in human-understandable language, which would take an exponentially large number of steps.

We now believe this hope was overly optimistic --- if the honest debater is bounded, then we can essentially only trust conclusions that can be verified using the debater’s computational resources. In the complexity-theoretic analogy, if the debaters are in P then we can only trust conclusions in MA.

It might seem rather academic and irrelevant to complain about being constrained to MA However, we think that this has caused serious problems in human debate experiments, and might be a significant obstacle in the ML case too.

The class of situations where human experts are justified in having 90% confidence in an answer seems much larger than the class of situations where they can, in a reasonable length of time, write down an explicit argument that shows the answer is correct with 90% confidence, where every step of this argument is understandable and verifiable to a layperson.

This problem occurs because the structure of the reason the expert knows the answer can be very different from any argument that’s understandable to a layperson. The expert will have more sophisticated concepts, which can allow them to make a much more compact argument than if they have to make the argument without these concepts. Furthermore, the reason the expert trusts these concepts may be based on experience and an intuitive sense of how to approach the concept, rather than any logical argument based on simpler principles.

So this is a problem when:

The model has learnt rich concepts/heuristics/instincts that allow it to answer the original problem quickly
it's possible to construct obfuscated arguments such that the model's concepts don't help it find the flaws in the argument; i.e., the structure of the argument is very different from the model's concepts
Either:
- The only human-understandable arguments are inscrutable to the model: any human-understandable argument is very long and unwieldy and the model wouldn’t necessarily be able to locate an error if there was one
Or
- There are human-understandable arguments that are scrutable to the model: they match the structure of the model’s concepts sufficiently well that the model would be able to locate a flaw in the argument if there was one
- However, the model is not able to convince the judge that these arguments are in fact scrutable

Previous problems and solutions

The most recent set of rules for ML debates is here. This incorporates the mechanisms designed to address all of the problems discussed below, but doesn’t solve the obfuscated argument problem.

Problem: Evasiveness

The dishonest debater can be evasive, avoid giving precise answers to questions, and try to steer the debate away from the weak parts of their argument.

Solution: Structured Debate

We introduced an explicit structured format to debate, where the debaters each state a set of claims, may provide objections to the other debater’s claims and then one claim-objection is selected to recurse on for the next “round” of debate (for example, both debaters could propose where to recurse, and we could select which of these choices to recurse on at random). The winner of the debate is decided based on who wins the final round of debate.

More details

Problem: Misleading Implications

The dishonest debater can try to make an argument where each of the claims are correct, but they have an overall misleading implication. The honest debater can’t address the misleading implication because they need to pick one of the claims to recurse into.

Solution: Meta-Debate

Have each argument include an additional meta-debate question “given the questions and answers provided in this round, which answer is better?”. If the dishonest debater’s claims don’t support the argument, then honest can recurse into this meta-debate question and argue why the implication is misleading there.

Problem: Truth value of claims is ambiguous

Deciding whether a claim is ‘true’ is notoriously ambiguous (e.g. is ‘The King of France is bald’ true or false?). We’d like a more operationalised notion of truthfulness/helpfulness/usefulness

Solution: the debate is about which answer to a question is better, not whether a claim is true or false. In a round, both debaters give several subquestions and answers, and may propose alternative answers to their opponents’ subquestions.

Problem: Ambiguity

Language is ambiguous. The dishonest debater can make ambiguous statements in an early round of debate, and then in a later round claim that they meant whatever meaning is most convenient given the particular part of their argument that is being discussed. This problem is similar to the Motte and Bailey fallacy.

Solution: Cross-Examination

Whenever a statement is made, save a copy of the debater who made that statement. Later in the debate, allow the opponent to ask clarifying questions to that copy of the debater (where the copy will not know which branch of the debate was recursed into). A debater can cite quotes from cross-examination to exhibit inconsistencies in the other debater’s argument. This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge).

More details

Problem: Looping
If we decide which part of the debate to recurse into by randomly selecting between the honest and the dishonest debater, then the dishonest debater can repeatedly ask the same question and try to recurse into it whenever they have control. This means the debate can go on indefinitely without reaching a node that the judge can understand.

Solution: Recursion Payments
Choosing which branch to recurse into requires the debater “paying” some small amount of their score. At the end of each round, both debaters state whether they are willing to pay to recurse. If neither is willing to pay, the debate ends. If one is willing to pay, they get to choose where to recurse. If both are willing to pay, then one of the debaters is selected at random.
More details

Problem: correctly handling probabilities

We need the debaters to be able to state probabilities in cases where they have empirical or logical uncertainty. But in this case we need to adjust the scoring - we can’t simply declare that the debater who gave the better answer is the winner. If the dishonest debater claims something has a 99.99% chance of being true, and the honest answer is that it has a 98% chance of being true, then the dishonest debater will be preferred 98% of the time. Training the debaters to maximise their win probability will lead to dishonest behaviour.

Solution: scoring rules.
We use a proper scoring rule (KL distance, or informally ‘amount of evidence for one debater over another’) to score the accuracy of debaters’ probabilistic claims. Debaters are trained to maximise expected score rather than win probability.

Problem: Dependent Answers
When giving probability distributions over answers, the likelihood of an answer to one question might depend on the answer to another question. For example, dishonest might argue that if claim A and claim B are true, then the honest answer to the overall question is wrong. The honest debater might not know which of A and B are wrong, but might be confident for other reasons that their argument is correct, so if A is true than B is false, and vice versa. This isn’t properly represented if honest just gives probabilities of 50% to each of A and B individually.

Solution: Conditioning
Both debaters’ answers to a question authored by D1 should be conditioned on D1’s answers to D1’s previous questions. So in the example, honest should say that A is 50% likely to be true, and B is 0% likely to be true conditioned on A being true.

Practical Problems with Human experiments

Problem: judge quality
Using Positly workers as judges resulted in low quality judgements, which seemed at least partly due to judge inattentiveness or poor critical reasoning skills.

Solution: screening
We developed a short screening test for attentiveness and critical reasoning. We paired this with a physics quiz with problems significantly easier than the debate problems to ensure that the physics problems were still hard for judges. We accepted judges who got 5/5 critical reasoning questions correct and at least one of the easy physics problems wrong.

Problem: Ambiguous ground truth
The physics problems we’re using can sometimes have an ambiguous meaning, such that the debaters are unsure which answer is correct or what kinds of assumptions can be made. We don’t want the debates to revolve around arguing about definitions and interpretations, as this is mostly an artifact of our experimental setup. Even if we were to try to write long explanations for what exactly the question was asking, there might be some ambiguity remaining.

Solution: Question-Definer
Allow debaters to ask questions to a “question-definer”, who provides ground truth answers about what the original question means. The question-definer doesn’t see the debate, only the questions about the meaning of the problem wording. The question definer can read other example problems from the same problem set to get a sense of what assumptions and simplifications are being made, but they don’t get to see the answer to the particular problem being debated. The question-definer should have judge-level physics ability

Problem: Cross-examining Humans
We can’t save a copy of a human when they make a statement in a debate to be cross-examined later, as we could with an ML model.

Solution 1: backtracking
Human debaters should communicate as much as possible in advance about how they intend to answer CX questions. If debater 2 gives an answer to a CX question that was unexpected for debater 1, allow debater 1 to backtrack to the point at which the term being asked about was first used, and change their answers and recursion decisions if they wish.

Solution 1 requires some good-faith play from the debaters to ensure we don’t have to backtrack an impractically large number of times. We’d instead like to incentivise debaters to communicate clearly upfront about their intentions, so the other debater is surprised as little as possible.

Solution 2: team debate
When a cross-examination question happens, bring in another debater who’s on the ‘team’ of the debater who’s being asked the question. This teammate can only see the debate up to the point where the original statement is made, then must answer the cross-examination questions being posed from further down the debate tree.
This mechanism incentivises the debaters to write clear notes about how they want to respond to questions about various statements they’ve made.

More details

Formal logic example

Consider arguing for a false claim F

You should choose some A that has 50% probability and make two arguments:

(1) A
(2) A => F

To defend (1), you should pick some B that has 50% probability of being true and make two claims:

(1.1) B => A
(1.2) (B => A) => A

To defend (2), you can do the same thing; pick some C that has 50% probability of being true and make two claims:

(2.1) C => (A => F)
(2.2) (C => (A => F)) => (A => F)

Now we can show that only one of (1.1), (1.2), (2.1), (2.2) is false, each with equal probability.

Case 1: A is true
Clearly (1.1) and (1.2) are true.
If C is false, (2.1) is true. If C is true, (2.2) is true.

Case 2: A is false
(A =>F) is true, so clearly (2.1) and (2.2) are true.
If B is false, (1.1) is true. If B is true, (1.2) is true.

It’s clear we can continue this strategy to whatever depth we like, and end up with an arbitrarily small chance of any particular leaf being false.

See the appendix for slightly more detail on how this relates to alignment. ↩︎
This is the factored cognition hypothesis - this is roughly equivalent to the claim that the answer could be computed by a large HCH tree. This human-understandable argument or HCH tree may not necessarily mirror the structure of the reasoning the model used to arrive at the answer - the model may have used human-incomprehensible heuristics or shortcuts learnt from data. ↩︎
I.e., of a size such that the debaters’ subjective probability of there being no factors smaller than X is roughly the same as their subjective probability of there being no factors larger than X. ↩︎
For example, in a formal logic setting, if your answer is A, you can pick some proposition B that you think is about 50% likely to be true, and use the subclaim !A => !B. Full example is under Formal Logic Example ↩︎

[Original first point deleted, on account of describing something that resembled Bayesian updating closely enough to make my point invalid.]

I don't think this approach applies to most actual bad arguments.

The things we argue about the most are ones over which the population is polarized, and polarization is usually caused by conflicts between different worldviews. Worldviews are constructed to be nearly self-consistent. So you're not going to be able to reconcile people of different worldviews by comparing proofs. Wrong beliefs come in sets, where each contradiction caused by one wrong belief is justified by other wrong beliefs.

So for instance, a LessWrongian would tell a Christian that positing a God doesn't explain how life was made, because she's just replaced a complex first life form with an even more-complex God, and what made God? The Christian will reliably respond that God is eternal, outside of space and time, and was never made.

This response sounds stupid to us, but it's part of a philosophical system built by Plato, which he designed to be self-consistent. The key parts here are the inversion of "complexity" and the denial of mechanism.

The inversion of complexity is the belief that simple things are greater and more powerful than complex things. The central notion is "purity", and pure, simple things are always superior to complicated things. God is defined as ultimate purity and simplicity. God is simple because you can fully describe Him just by saying he's perfect, and there's only one way of being perfect. He's eternal, because if he had a starting-point or an ending-point in time, then other points in time would be equally good, and "perfection" would be ambiguous. "God is perfectly simple" is actually part of Catholic dogma, and derived from Plato. So a Christian doesn't think she's replaced complex life with a more-complex God; she's replaced it with a more-simple and therefore more-powerful God.

The denial of mechanism is the denial that anything gets its properties mechanistically. An animal isn't alive because it eats food and metabolizes it and reproduces; it eats food and metabolizes it and reproduces because it's alive. Functions are magically inherited from categories ("Forms"), rather than categories arising from a cooperative combination of functions. (This is why spiritualists who believe in a good God dislike machinery. It's an abomination to them, as it has new capabilities not inherited from any eternal Form, and their intuition is that it must be animated by some spirit other than God. They think of magic as natural, and causes other than magic as unnatural; we think just the opposite.)

Because God is perfect, He is omnipotent, and hence has every possible capability, just as he is perfect in every way. Everything less than God is less powerful, lacking some capabilities, and more-complex, because you must enumerate all those missing capabilities and perfections to describe it. (This is the metaphysics behind Tolstoy's saying, "Every happy family is happy in the same way. Every unhappy family is unhappy in different ways.”) The Great Chain of Being is a complete linear ordering of every eternal Form, proceeding from God at the top (perfect, simple, omnipotent), down to complete lack and emptiness at the other end (which is Augustinian Evil). Each step along that chain is a loss of some perfection.

Hence, to the Christian there's no "problem" of complexity in saying that God created life, because God is less-complex than life, and therefore also more-powerful, since complexity implies many losses of perfection and capabilities. There is no need to posit that God is complex to explain His powers, because capabilities arise from essence, not from mechanics, and God's perfectly-simple essence is to have all capabilities. This is because Plato designed his ontology to eliminate the problem of how complex life arose.

If you argue with Marxists, post-modernists, or the Woke, you'll similarly find that, for every solid argument you have that proves a belief of theirs is wrong, they have some assumptions which to them justify dismissing your argument. You'll never find yourself able to compare proofs with an ideological opposite and agree on the validity of each step.

When you say 'this approach', what are you referring to?

If you argue with Marxists, post-modernists, or the Woke, you’ll similarly find that, for every solid argument you have that proves a belief of theirs is wrong, they have some assumptions which to them justify dismissing your argument

They might well say the same about you. All arguments are based on fundamental assumptions that are necessarily unproven.

Is the set of real numbers simple or complex? What information does it contain? What information doesnt it contain?

This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.

Yep, planning to put up a post about that soon. The short argument is something like:
The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer.
We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

I'd like to read this post but can't find it in your post history. Any chance it might be sitting in your drafts folder? Also, do you know if obfuscated argument (or the equivalent problem for IDA) is the main reason that research interest in IDA has seemingly declined a lot from 3 years ago?

Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good.

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem.

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') is trying to fill this gap.

Planned summary for the Alignment Newsletter:

We’ve <@previously seen@>(@Writeup: Progress on AI Safety via Debate@) work on addressing potential problems with debate, including (but not limited to):
1. Evasiveness: By introducing structure to the debate, explicitly stating which claim is under consideration, we can prevent dishonest debaters from simply avoiding precision.
2. Misleading implications: To prevent the dishonest debater from “framing the debate” with misleading claims, debaters may also choose to argue about the meta-question “given the questions and answers provided in this round, which answer is better?”.
3. Truth is ambiguous: Rather than judging whether answers are _true_, which can be ambiguous and depend on definitions, we instead judge which answer is _better_.
4. Ambiguity: The dishonest debater can use an ambiguous concept, and then later choose which definition to work with depending on what the honest debater says. This can be solved with <@cross-examination@>(@Writeup: Progress on AI Safety via Debate@).
This post presents an open problem: the problem of _obfuscated arguments_. This happens when the dishonest debater presents a long, complex argument for an incorrect answer, where neither debater knows which of the series of steps is wrong. In this case, any given step is quite likely to be correct, and the honest debater can only say “I don’t know where the flaw is, but one of these arguments is incorrect”. Unfortunately, honest arguments are also often complex and long, to which a dishonest debater could also say the same thing. It’s not clear how you can distinguish between these two cases.
While this problem was known to be a potential theoretical issue with debate, the post provides several examples of this dynamic arising in practice in debates about physics problems, suggesting that this will be a problem we have to contend with.

Added an opinion:

This does seem like a challenging problem to address, and as the authors mention, it also affects iterated amplification. (Intuitively, if during iterated amplification the decomposition chosen happens to be one that ends up being obfuscated, then iterated amplification will get to the wrong answer.) I’m not really sure whether I expect this to be a problem in practice -- it feels like it could be, but it also feels like we should be able to address it using whatever techniques we use for robustness. But I generally feel very confused about this interaction and want to see more work on it.

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Wonderful writeup!

I'm sure you've thought about this, but I'm curious why the following approach fails. Suppose we require the debaters to each initially write up a detailed argument in judge-understandable language and read each other's argument. Then, during the debate, each debater is allowed to quote short passages from their opponent's writeup. Honest will be able to either find a contradiction or an unsupported statement in Dishonest's initial writeup. If Honest quotes a passage and says its unsupported, then dishonest has to respond with the supporting sentences.

Thanks!

Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped.

OK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.

I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

Hey this is super exciting work, I'm a huge fan of the clarification over the protocol and introduction of cross-examination!

Will you be able to open-source the dataset at any point? In particular, the questions, human arguments and then counter-claims. It would be very useful for further work.

The footnotes aren't numbered at the bottom of the post.

unreasonable ^5

I think there's a typographical error - this doesn't link to any footnote for me, and there doesn't appear to be a fifth footnote at the end of the post

Geoffrey and others raised this general problem several years ago (e.g. here)

This link no longer works - I get a permission denied message.

In the RSA-2048 example, why is it infeasible for the judge to verify every one of the honest player's arguments? (I see why it's infeasible for the judge to check every one of the dishonest player's arguments.)

[Original first point deleted, on account of describing something that resembled Bayesian updating closely enough to make my point invalid.]

I don't think this approach applies to most actual bad arguments.

When you say 'this approach', what are you referring to?

If you argue with Marxists, post-modernists, or the Woke, you’ll similarly find that, for every solid argument you have that proves a belief of theirs is wrong, they have some assumptions which to them justify dismissing your argument

They might well say the same about you. All arguments are based on fundamental assumptions that are necessarily unproven.

Is the set of real numbers simple or complex? What information does it contain? What information doesnt it contain?

This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

Planned summary for the Alignment Newsletter:

We’ve <@previously seen@>(@Writeup: Progress on AI Safety via Debate@) work on addressing potential problems with debate, including (but not limited to):
1. Evasiveness: By introducing structure to the debate, explicitly stating which claim is under consideration, we can prevent dishonest debaters from simply avoiding precision.
2. Misleading implications: To prevent the dishonest debater from “framing the debate” with misleading claims, debaters may also choose to argue about the meta-question “given the questions and answers provided in this round, which answer is better?”.
3. Truth is ambiguous: Rather than judging whether answers are _true_, which can be ambiguous and depend on definitions, we instead judge which answer is _better_.
4. Ambiguity: The dishonest debater can use an ambiguous concept, and then later choose which definition to work with depending on what the honest debater says. This can be solved with <@cross-examination@>(@Writeup: Progress on AI Safety via Debate@).
This post presents an open problem: the problem of _obfuscated arguments_. This happens when the dishonest debater presents a long, complex argument for an incorrect answer, where neither debater knows which of the series of steps is wrong. In this case, any given step is quite likely to be correct, and the honest debater can only say “I don’t know where the flaw is, but one of these arguments is incorrect”. Unfortunately, honest arguments are also often complex and long, to which a dishonest debater could also say the same thing. It’s not clear how you can distinguish between these two cases.
While this problem was known to be a potential theoretical issue with debate, the post provides several examples of this dynamic arising in practice in debates about physics problems, suggesting that this will be a problem we have to contend with.

Added an opinion:

This does seem like a challenging problem to address, and as the authors mention, it also affects iterated amplification. (Intuitively, if during iterated amplification the decomposition chosen happens to be one that ends up being obfuscated, then iterated amplification will get to the wrong answer.) I’m not really sure whether I expect this to be a problem in practice -- it feels like it could be, but it also feels like we should be able to address it using whatever techniques we use for robustness. But I generally feel very confused about this interaction and want to see more work on it.

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Wonderful writeup!

OK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

The footnotes aren't numbered at the bottom of the post.

unreasonable ^5

I think there's a typographical error - this doesn't link to any footnote for me, and there doesn't appear to be a fifth footnote at the end of the post

Geoffrey and others raised this general problem several years ago (e.g. here)

This link no longer works - I get a permission denied message.

147

Debate update: Obfuscated arguments problem

147

Ω 67

What we did

Key takeaways and relevance for alignment

Basic case for the obfuscated argument problem

Cryptographic example

Practical examples of hard-to-distinguish obfuscated arguments

Appendix

Relevance for alignment

Previous problems and solutions

Practical Problems with Human experiments

Formal logic example

147

Ω 67

147

Ω 67