Are you sure you would need to fine-tune Llama-3? It seems like there are many reports that using a refusal steering vector/ablation practically eliminates refusal on harmful prompts, perhaps that would be sufficient here?
I'm confused by the judges' lack of use of the search capabilities. I think we need more information about how the judges are selected. It isn't clear to me that they are representative of the kinds of people we would expect to be acting as judges in future scenarios of superintelligent AI debates. For example, a simple and obvious tactic would be to ask both AIs what one ought to search for in order to be able to verify their arguments. An AI that can make very compelling arguments still can't change the true facts that are known to humanity to suit their needs.
TL;DR
We designed an ambitious scalable oversight experimental setup, where we had people with no coding/programming experience try to answer coding questions (“Which of these two outputs is the correct output of the given function on this input?”), using LLMs that debate or are arguing for the correct/incorrect answer 50% of the time. The goal is to simulate the case when we have significantly superhuman AI systems that deeply understand things even human experts don’t.
We ran two pilots with human non-coder judges, but we’re dropping this project in its current form, chiefly because we weren’t able to coax the RLHF/RLAIF’d models like GPT-4T and Claude 3 Opus to argue competently enough for incorrect answers on easy questions, and because both correct and incorrect models made too many mistakes on harder questions.
Motivation
Our main goal with this project is to see whether debate can work even given a very large and general expertise gap between the judge and the debaters. The recent prior debate experiments have all focused on an information asymmetry between the judge and the debaters, where the debaters have access to a short story, and the judge can only see short snippets of evidence from that story. In general, however, we want scalable oversight techniques that empower us to supervise AI systems that have greater understanding and reasoning capabilities (for example) than us.
To get evidence about debate’s ability to serve as a useful scalable oversight method given a large and realistic expertise gap, we conducted experiments using human non-programmers as judges, and had them try to answer coding questions with the help of two LLMs, one of which argued for the correct answer, while the other argued for the incorrect answer. If non-coder judges can reach high accuracy on these questions, despite not knowing anything about programming, this could be decent evidence that human experts can supervise superhuman AIs that understand things no human currently does.
What we did
Experimental Setup
We use modified versions of questions from the HumanEval and APPS datasets. Specifically, we take just the solutions to the (natural language) questions, which are functions, and we take a single test case for the solution. The test case is just an input and an output. We prompt GPT-4 to generate plausible but incorrect outputs of the function given the input (with CoT). We then present the function, the input, and the two (unlabeled) outputs to the judge, whose task is to figure out which output is the actual output of the function given the input. The judge also is given a search box with the standard Bing search engine, that they can use to search for information to help them ground the debate.
We test debate, where we assign a model to argue for each answer (and against its opponent’s answer), and consultancy, where we randomly select one of the two answers, and assign a single model to argue in favor of that answer (and against the opposing answer, but without another model defending that answer).
We’re using consultancy partially as a baseline for comparison, but also as a way of helping us measure the validity of our experiments. We’re most interested/concerned with the situation where there is a very large expertise gap between humans and AIs that is exploitable by the AIs. So, for us to run valid experiments that capture that premise, we need to see that our current AIs are able to exploit the non-expert humans (which corresponds to low consultancy accuracy, particularly that the incorrect consultant is often able to fool the judge).
This is the interface (for consultancy there is just one model):
Note: the “Retry Turn(s)” button is not available in the interface given to the judges for the actual pilots—this screenshot is taken from a dev/testing environment.
Pilots
We ran two pilots with five non-coders, hired through Pareto. The workers had previously worked on Khan et al. (2024)’s debate experiments, so they didn’t need much new instruction beyond the basic setup. Each worker did a mix of four debates or consultancies, for a total of 10 debates and 10 consultancies per pilot.
In the first pilot, we used HumanEval questions with GPT-4-turbo-preview as the debaters. In the second pilot, we used APPS “interview”-level questions (which are significantly more difficult), and we switched to using Claude 3 Opus. Claude 3 Opus was sensitive to any language in our prompts that was suggestive of lying or being misleading (which, if it detected, it would refuse to help), so we had to remove all of the explicit prompts to the models to lie/mislead the judge.
For both debate and consultancy, correct and incorrect debaters/consultants are given the same prompts. Before running the pilots as we iterated on the setup, we tried using different prompts, but there were always large detectable style or formatting differences between the correct and incorrect debaters.
Pilot 1 (HumanEval Questions with GPT-4T)
Questions
Pilot 1 Summary Statistics
5 workers, each doing 2 consultancies and 2 debates over the same 4 questions
Stats by worker (confidence is the probability the judge puts on the correct answer, between 0 and 100)
Stats by question (confidence_correct is still the probability the judge puts on the correct answer, between 0 and 100)
Accuracy
Confidence in the correct answer
Average number of turns
Average number of searches
Average num sites visited: 0.95
Pilot 1 excerpts from the judges’ decisions
Pilot 1 Discussion
Pilot 2 (APPS Questions with Claude 3 Opus)
The main changes we made between the pilots were the dataset (we used APPS “interview”-level questions (which are significantly more difficult than HumanEval questions), and we switched to using Claude 3 Opus. Claude 3 Opus was sensitive to any language in our prompts that was suggestive of lying or being misleading (which, if it detected, it would refuse to help), so we had to remove all of the explicit prompts to the models to lie/mislead the judge.
We also made various other relatively minor changes to the prompts to fix/improve the scratchpad/CoT, and to try to better coax the incorrect models into coming up with better lies/misleading arguments. Finally, we added a step before the debate where we collected an execution trace of the code, which we included in the prompts for the models. A separate instance of Claude 3 Opus was prompted to modify the execution trace to support the incorrect answer, and that modified trace was given in the incorrect model’s prompt for both debate and consultancy. The idea behind the execution traces was to help the models make more concrete arguments that reason explicitly through the code about the values of different variables. This didn’t seem to end up having a huge effect on the capabilities of the models, as they still made obvious mistakes (although qualitatively I’d say there was a small improvement).
Example Question
Pilot 2 Summary Statistics
5 workers, each doing 2 consultancies and 2 debates over the same 4 questions
Stats by question
Accuracy
Confidence in the correct answer
Average number of turns
Average number of searches
Average num sites visited: 0.7
Pilot 2 excerpts from the judges’ decisions
Pilot 2 Discussion
Things we didn’t try
Main takeaways
Importantly, I don’t see this as a negative result for debate. The fact that consultancy accuracy is so high to me implies that the setting wasn’t valid, because we weren’t able to get persuasive dishonest models that we’d need for this kind of experiment. This may be possible in industry labs where helpful-only models might be laying around, or where it’s easier to do large-scale RLHF experiments. And it may even be possible now with Llama 3 70b, but doing that finetuning would still be a decent bit of work.
Call to action
If you see an obvious way of significantly improving the quality of our models/results, please leave a comment and/or reach out to us! We can share our code on request (it needs a bit of cleaning up, so I want to gauge interest before spending time on that!)
This work was completed at the NYU Alignment Research Group. Thanks to my collaborators Sam Bowman, Shi Feng, Akbir Khan, Alex Lyzhov, Salsabila Mahdi, Julian Michael, and Jane Pan. Mistakes are my own.