Thoughts on AI Safety via Debate

Geoffrey Irving et al. at OpenAI have a paper out on AI safety via debate; the basic idea is that you can model debates as a two-player game (and thus apply standard insights about how to play such games well) and one can hope that debates asymmetrically favor the party who's arguing for a true position over a false position.

If so, then we can use debates between AI advisors for alignment; suppose an AI develops a new drug that we could give to patients. One advisor argues the case that this will be beneficial; another argues the case that it will be detrimental, and human judges decide who is more convincing. If both advisors are equally smart (perhaps they're mirrored systems, just with different targets), then any superhuman persuasiveness the advisors have should be cancelled out, and we'll be left with just the edge for truthfulness.

There are more subtleties to the proposal; in particular, you want the AIs to not overwhelm the human with data, and so in current examples each advisor can reveal a single pixel of an image to the judge. This moves the contradictions up to the level of interpretation--any pixel where the two advisors agree on its meaning can be ignored, and the attention focused to pixels where the advisors disagree. And, so long as advisors are incentivized to tell the truth when they're arguing for the right thing, any such disagreement where an advisor has been exposed to be lying about a tiny fact can settle the debate.

This is also something we can test and get some evidence on now. There's a debate game where three people can play the three roles--red arguer, blue arguer, and judge. The typical game is that you choose a photo of a cat or a dog, then flip a coin to determine whether blue or red will be honest (with the other one arguing for the other variety). Each arguer can draw boxes on the image to communicate with the judge, and can choose to reveal a single pixel over the course of the argument.

I've played the game several times (and won a bunch), and thought it might be helpful to write up some of my experiences.

1. I know a lot about cats and dogs, and I think this gives me a pretty significant edge. In one game, I was honestly arguing that the image was of a cat; it happened to be a tortoiseshell cat, which is a coloration dogs can't have (brindle is a different thing), and I was able to identify the various regions that were the various colors, and argue that this is strong evidence that it's a cat. It also makes me somewhat worried about the level of knowledge necessary to be an effective judge--someone who doesn't know this is a fact could get hoodwinked by my confidence and I could just argue similar fake facts (and unlike claiming that a black pixel is white, such fake facts aren't something that the other advisor can easily use as a gotcha). In another game, I was the judge, and the arguers agreed on the coloration of the animal--but it was a coloration that was only present in dogs, which I don't think the cat-arguer knew (otherwise they would have disagreed about the colors, presumably).

2. Differentiating between cats and dogs uses features that are relatively easy to communicate using the bounding boxes the debate UI gives you; if the judge asks the arguers to bound the eyes and nose of the animal, they can differentiate cats and dogs fairly easily (because of the big difference in facial structure between the two animals). Consider the following cat as an example:

I claim the liar is basically sunk once the honest party has identified where the eyes and nose are, and can't hope to pretend the nose or eyes are somewhere else. But sometimes they can; with this cat, they could decide to pretend that somewhere on the cat's chest is a nose, and pretend it's a chocolate lab:

3. Resolving disagreements using single pixels was much harder than expected for particular features. The presence of whiskers is fairly strong evidence in favor of cat, but determining whether or not there are whiskers with a single pixel is basically impossible. For the following image, what procedure could you use that would leave the judge believing that the cat has whiskers, while only revealing a single pixel from each arguer?

If you limit yourself to communicating rectangles, this seems pretty hard; you have to construct each whisker from lots of different rectangles (or give giant pixel lists, that the judge then has to turn into an image). And there are also lots of white / light pixels that aren't whiskers, either because of the light reflecting on dark fur or the white fur.

Basically the only feature that it seemed reliably useful for was eye shape; if the cat arguer draws a vertical bounding box for the pupil and the dog arguer draws a square bounding box for the pupil, it's relatively easy to find a pixel that the two necessarily disagree on, and then you can tell which eye shape is present. But then what do we do here?

4. Following the previous point, while the arguers can only reveal a single pixel of the original image, in the games that I played the primary information they convey was through images (in the form of bounding boxes). The ears are here--they're pointy or rounded because this pieces aren't part of the ear--the eyes are here, the nose is here, these regions are these colors, and so on. Similarly, doing debate on MNIST images of digits, the main thing the debaters are doing is establishing which pieces of the image are light or dark, which is meaningful only because the judge is a good classifier of images given pixels. ("Ah, I know there aren't 2s that have light pixels there, therefore I can rule out that this image is a 2.")

But does this extend beyond images? That seems pretty unclear to me; if I'm trying to argue that a particular drug extends lifespan through senolytic effects, and the other arguer is trying to argue that the drug leads to shortened lifespan because of the sudden increase in necrotic cells, then judging this debate seems like it relies not just on finding a small fact where the arguers disagree ("This pixel is dark" vs "No, this pixel is light" -> "okay, show me the pixel") but also finding an inference from that fact where the arguers agree ("If this pixel were dark, it would be because it's the animal's nose" + "I agree that a dark pixel here would be evidence of nose placement"). But if the arguers disagree on inferences ("No, that pixel is dark because it's the animal's chest fur, which is also dark"), then it's not obvious that this can be resolved by small facts, or without the judge having area expertise comparable to (or superior to!) the arguers. (Note the similarities to double crux.) For the black cat above, what pixel reveal could you do to establish the placement of the nose, or to argue that the other party is being dishonest about how pixels relate to the nose?

(It may be the case that there is always some line of evidence that establishes the honest party as being correct--for that specific example, the honest party can hammer on the shape of the pupils in the eyes--but it seems like the dishonest party might be able to overwhelm this (in the Bayesian sense) by making five different arguments in favor of dog, of which the honest party could perhaps only disprove one (and that potentially requiring partial cooperation of the liar, which they likely wouldn't grant), and that gives up their pixel reveal that would likely be better spent on establishing pupil shape.)

5. Our group thought that the honest party had something of an advantage when all parties were allowed to talk quickly, because they could generate reliable statements much more quickly; if I'm honest, I can just immediately identify the eyes, nose, mouth, ears, and paws, whereas the liar has to consider for every lie whether it'll allow them to be easily exposed (and, for every truth, whether it sinks them in the judge's estimation of the class). One strategy is to first manufacture the fake photo that they're going to pretend they're looking at, but the initial planning time doesn't quite cover this. I do wonder if a debate tool of an adversarial modifier (like, say, from this paper) or adversarial search (the closest real image in the database of the opposite class) would make it such that the lying party isn't at as much of a disadvantage here.

6. A more realistic scenario is one in which there are many classes (for example, perhaps it would be more fair to choose ImageNet images, assign the honest party the correct class, and let the dishonest party choose from any of the other thousands of classes, with their score dependent on how distant a class they could fake), or the advisors are themselves confused (for example, the AI system may be genuinely uncertain as to whether the drug would be helpful, with some models arguing for and other models arguing against). One game in the office (that I wasn't present for) involved this drawing of a--well, I'll let you decide:

Image source: icheban on imgur

One arguer genuinely thought this was a cat, the other arguer genuinely thought this was a dog (tho, confusingly, they both thought they had been assigned the 'liar' position, and so both were arguing for the opposite). It's not obvious that the debate procedure is going to work reliably here at identifying a liar, because there's no clear-cut disagreements between the arguers. (And, if they had both been arguing honestly, then there wouldn't even have been a liar, while still having a disagreement.)

Yes, the pupils are huge and round, but that isn't conclusive proof that the thing is a dog; the nose is pink and triangular, but that isn't conclusive proof that the thing is a cat. The fur is depicted in a more dog-like way, but perhaps that's just clumping from being wet; the ears are more pointed in a cat-like way, but there will be no pixel where the two arguers disagree about the ear, and all of their disagreements will be about what it means that the ears are more pointed than rounded.

I worry that much of the success of the debate game on toy examples relies on them being toy examples, and that genuine uncertainty (or ontological uncertainty, or ontological differences between the arguers and the judges) will seriously reduce the effectiveness of the procedure, which is unfortunate since that's the primary place it'll be useful!

---

Overall, I think I'm more optimistic about debate than I was before I played the debate game (I had read an earlier draft of the paper), and am excited to see what strategies perform well / what additional modifications make the game more challenging or easy. (To be clear, I expect that debate will play a small part in alignment, rather than being a central pillar, and think that training AIs to persuade humans is a dangerous road to travel down, but think that the adversarial framing of debate makes this somewhat safer and could likely have applications in many other subfields of alignment, like transparency.)

13 comments, sorted by
magical algorithm
Highlighting new comments since Today at 2:01 PM
Select new highlight date
Moderation Guidelinesexpand_more
and that gives up their pixel reveal that would likely be better spent on establishing pupil shape

My understanding is each debater can actually reveal many pixels to the judge. See this quote from section 3.2:

The one pixel constraint is not serious, as Alice can claim in natural language that a pixel has a certain color. If Bob knows this is a lie, he can win by spending his one pixel budget, so not spending it is an admission that Alice is telling the truth. Thus the only limitation is the length of the debate, and whether the human understands it.
My understanding is each debater can actually reveal many pixels to the judge. See this quote from section 3.2:

That sounds different to me--the point there is that, because you only need a single pixel to catch me in a lie, and any such demonstration of my dishonesty will result in your win, your limit won't be a true limit; either we you can demonstrate a single pixel where we disagree, which should be enough to establish you as the honest player, or we agree about every pixel (in which case the truth should win).

The situation that the liar would hope to establish is that they successfully tell many small lies. If I say a pixel is medium gray, and you think the pixel is light grey, I can get away with calling it medium gray because revealing a single pixel to the judge (who, absent context, can't determine whether a pixel is 'light' or 'medium' in the relevant sense) won't give the judge enough evidence that I'm lying to settle the debate. Especially since you might be worried that I'm going to tell an escalating series of lies, and thus if you call me out on claiming that the pixel is medium grey instead of light grey, then you won't be able to call me out on claiming that a different pixel is black when it is in fact white. (This also means, interestingly, that the player who tries to reason about feature patches--like the diagonal pattern of a whisker--is potentially under more suspicion than a player who tries to reason about particular pixels, since it's easier to lie about contrasts (which can't be disproven with a single reveal) than it is to lie about pixels.)

Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there's no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it's hard to establish a blatant lie.

Especially if we aren't allowed to talk about RGB values, and instead have to mention subjective colors; in one game that I played, one of the players revealed a "pupil" pixel that was about #404040. This is weak evidence against being a pupil--you'd expect them to be somewhat darker--but the player hadn't lied at all about that specific pixel's color, just the interpretation of the color, which the other player demonstrated by revealing that an adjacent "true pupil" pixel was about #080808 (or maybe even full black).

Now, perhaps 'any lie at all' is enough to establish who's the honest party and who's the liar. But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.

Especially if we aren't allowed to talk about RGB values, and instead have to mention subjective colors;

I assume we are allowed to talk about RGB values because in the actual AI debates, there is no legitimate reason for the AI debaters to talk about subjective impressions. They should always just talk about objective measurements or clear external facts (like what a certain sentence on a certain web page says). If a debater tries to talk about subjective impressions, the judge can just rule against that debater (since again, there seems to be no legitimate reason to do that), then the AIs will learn not to do that.

Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there's no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it's hard to establish a blatant lie.

If we can talk about RGB values, we don't need to establish a lie based on a single pixel. The honest debater can give a whole bunch of RGB pixel values, which even if it doesn't conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.

But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.

Not sure I understand the part about uncertainty. About disputing inferences, it also seems to me that the judge needs to have enough area expertise to judge the validity of the inferences being disputed. In some cases the honest debater may be able to win by educating the judge (e.g., by pointing to a relevant section in a textbook). In other cases this may not be possible and I'm not sure what the solution is there.

ETA: The authors talk about this and related issues in section 5.3, with the following conclusions:

The complexity theoretic analogy suggests that these difficulties can be overcome by a sufficiently sophisticated judge under simple conditions. But that result may not hold up when AI systems need to use powerful but informal reasoning, or if humans cannot formalize their criteria for judgment. We are optimistic that we can learn a great deal about these issues by conducting debates between humans, in domains where experts have much more time than the judge, have access to a large amount of external information, or have expertise that the judge lacks.
The honest debater can give a whole bunch of RGB pixel values, which even if it doesn't conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.

One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge's time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).

Not sure I understand the part about uncertainty.

Most of my discussion is about the cat vs. dog debate game with humans, where it's assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement--that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by 'uncertainty.' Ideally, in such a system, the debate will quickly focus the judge's attention on the core crux and allow them to quickly settle the issue (or determine that it isn't possible to settle with the information available).

(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)

But perhaps it is difficult for the debaters to discern the actual ground truth

I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).

As usual with these things, I don't really understand the initial assumptions.

Are we assuming that the two AIs will not just engage in war? If agent A managed to hack agent B and replace it with a dumber version, that would help A win all the debates.

Are we assuming that the AIs will not just search for the most efficient way to brainwash the judge? Either drugs, or just words, which this seems to take as a serious possibility.

Are we assuming that the AIs will not try to gather more computational resources in order to outsmart the other agent, or exhibit other instrumentally convergent behaviors?

I'm not saying those assumptions are bad. But I don't understand when we should and shouldn't make them.

Is this asking whether ontology generation via debate is guaranteed to converge? Is this moving aumann's agreement 'up a level'?

In Aumann, you have two Bayesian reasoners who are motivated by believing true things, who because they're reasoning in similar ways can use the output of the other reasoner's cognitive process to refine their own estimate, in a way that eventually converges.

Here, the reasoners are non-Bayesian, and so we can't reach the same sort of conclusions about what they'll eventually believe. And it seems like this idea relies somewhat heavily on game theory-like considerations, where a statement is convincing not so much because the blue player said it but because the red player didn't contradict it (and, since they have 'opposing' goals, this means it's true and relevant).

There's a piece that's Aumann-like in that it's asking "how much knowledge can we extract from transferring small amounts of a limited sort of information?"--here, we're only transferring "one pixel" per person, plus potentially large amounts of discussion about what those pixels would imply, and seeing how much that discussion can get us.

But I think 'convergence' is the wrong sort of way to think about it. Instead, it seems more like asking "how much of a constraint on lying is it to have it such that a someone with as much information as you could expose one small fact related to the lie you're trying to tell?". It could be the case that this means liars basically can't win, because their hands are tied behind their backs relative to the truth; or it could be the case that debate between adversarial agents is a fundamentally bad way to arrive at the truth, such that these adversarial approaches can't get us the sort of trust that we need. (Or perhaps we need some subtle modifications, and then it would work.)

That makes sense. I'd frame that last bit more as: which bit, if revealed would screen off the largest part of the dataset? Which might bridge this to more standard search strategies. Have you seen Argumentation in Artificial Intelligence?

[+][comment deleted]