How do you decide what to set ε to? You mention "we want assumptions about humans that are sensible a priori, verifiable via experiment" but I don't see how ε can be verified via experiment, given that for many questions we'd want the human oracle to answer, there isn't a source of ground truth answers that we can compare the human answers to?
With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.
How should I think about, or build up some intuitions about, what types of questions have an argument that is robust to an ε-fraction of errors?
Here's an analogy that leads to a pessimistic conclusion (but I'm not sure how relevant it is): replace the human oracle with a halting oracle, the top level question being debated is whether some Turing machine T halts or not, and the distribution over which ε is define is the uniform distribution. Then it seems like Alice has a very tough time (for any T that she can't prove halts or not herself), because Bob can reject/rewrite all the oracle answers that are relevant to T in some way, which is a tiny fraction of all possible Turing machines. (This assumes that Bob gets to pick the classifier after seeing the top level question. Is this right?)
I think there are roughly two things you can do:
I think we need a much more complex error theory for humans. The probability of a human making a mistake goes up if they are sick, tired, rushed, stressed, being manipulated (for example by sycophancy), it also depends on the individual skills and capabilities of the human in question, and on the difficulty of the specific task for that specific human. When the human make errors, some of these factors tend to produce relatively unbiased mostly-random errors, while others like being manipulated may produce very directed and predictable errors.
All of this is obvious and well-known (especially to anyone used to working in the soft sciences, or otherwise working with humans a lot) — both to the humans and to the AIs involved. When building a system that includes human input or feedback, all of this should be taken into account. Current work on weak-to-strong generalization barely scratches the surface of this — it just acknowledges that the weak supervisor can make mistakes, but doesn't attempt to learn about or model and then compensate for any of the deep, complex structure to these.
I think the Value Learning framework is useful here. Ideally we would like our smart AIs to be researching human values, including researching errors in feedback/input that humans give about their values. The challenge here is that the AI doing the value learning needs to be, not exactly aligned, but at least sufficiently close to aligned that it actually wants to do good research to figure out how to become (or create a successor that is) better aligned. So that converts this into a three-step problem:
1) humans need to figure out how large the region-of-convergence is: how aligned does an AI need to be for it performing value learning to produce results that are better aligned than it is? In particular, we want a confident lower bound on the size of the convergence region — a region in which we're very confident this process will converge.
2) humans need to test their AI and convince themselves that it's at least aligned enough to be inside the region of convergence
3) now we can work with the semi-aligned AI to create a better aligned successor, and iterate that process as it converges
One thing that is actually helpful for step 1) is that the stakes of AI alignment are extremely high. Any model sufficiently close to aligned that it doesn't want to risk the extinction or similar X-risks like permanent loss-of-autonomy of the human race is motivated to want to do a good job of value learning — then we need it to also be good enough at understanding humans that it's capable of understanding us so capable of performing this research, which seems like a low bar for LLMs (e.g. LLM simulations of doctors already have better bedside manner than the majority of real doctors). So basically, because the stakes are so high, a bare minimum of "don't kill every human" level semi-alignment seems likely to be most of what we need at step 1).
Of course, that still leaves step 2) — which seems like the area where things like debate might be helpful. However, at step 2) we're presumably dealing with something around or at most just above smart-human-capabilities level, not a full-blown ASI. Under this approach, we should already be well into step 3) convergence before we got to full-blown ASI.
A more subtle issue here is whether the region of convergence has a unique stable point inside it, or if there are multiple, and if the latter how we identify them and decide which one to aim the process for. That sounds like the sort of question that might come up once we were well into the ASI capabilities level, and where debate might be very necessary.
Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.
The complexity theory models of debate assume some expensive verifier machine with access to a human oracle, such that
Typically, is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves of the tree. Key design elements encapsulated by the verifier machine (rather than by the interactive proof protocol) are those related to the structure of the underlying debate, such as whether counterarguments are used, when the human oracle is consulted, and how the judges are trained.
Model the human oracle as type:
where is the question type, is the answer type, and is distributions over (the humans are stochastic). Assumption (1) above then becomes
This claim includes robustness to errors: only has oracle access to , so seeing through noise in requires to sample enough. It also includes robustness to systematic error: if is systematically wrong, must sample different questions to dodge the wrongness.
This is a lot to ask of ! A more practical situation would be:
A candidate form for is the assumption that has not too many errors, but we don’t know where they are:
There may be other errors in for which we do know where they are; those can hopefully be dealt with by other means. Note that defining couples cognitive science and theory: we want assumptions about humans that are sensible a priori, verifiable via experiment, and admit nice theoretical protocols.
Consider an asymmetric debate where Alice is trying to convince the judge that a statement is true, and Bob is trying to falsify Alice. We can give Bob the ability to adversarially manipulate some of the oracle’s responses, by following the additional rule:
With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an -fraction of errors.
(It’s less clear what happens if we try to impose efficiency limits on the debaters, such as limiting them to small circuits: in this case our equilibria can only take into account -subsets of queries which can be efficiently described. We have some upcoming work trying to add these limits in a recursive debate context, and we can see whether the two approaches combine once that is finished. My intuition is that something meaningful can still be said, but not with confidence.)
However, the protocol is also misleading as stated, since I haven’t specified the distribution against which the Infrequent rule is measured. This is a problem, because there are at least two classes of distributions one might have in mind:
These distributions are super different! Certainly one can imagine weird debater behaviour concentrating queries to humans on some alien distribution of questions, such that most of the alien distribution falls within a particular -fraction of a natural distribution. So even if humans are wrong not too much of the time on a natural distribution, this doesn’t get you to safety.
Note that we always know the induced distribution on queries: it can be read out of the behaviour of trained debate agents. We don’t necessarily know the natural distribution. In both cases, claim (the fact that queries are correct except on an -fraction w.r.t. ) is still an assumption.
Let’s assume we do know and trust all but an -fraction of queries w.r.t. . Then our “Bob rejection” protocol at the start is a well-defined computation. Can we run debate on it? The direct answer is no, since the first step is “Bob chooses a classifier”, and this is a very large object (not polynomial space). However, we can use the cross-examination trick to represent the classifier as a snapshotted activation state of an agent, using the following protocol:
This allows us to have a classifier that could plausibly have a decent strategy while ensuring consistency across queries and not rejecting too often.
However, as stated, this requires Cliff to have thought out a lot of things in advance, which could be very hard. It may be possible to fix this by incrementally constructing Cliff as we go down the tree, but I am not sure how: the issue is that different parts of the tree can make overlapped oracle queries, so it is not as simple as choosing how to apportion rejection probability across a branch in the computation.
We'd be very excited to collaborate on further research. If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000 directly here.