Thanks to Joe Carlsmith, Paul Christiano, Richard Ngo, Kate Woolverton, and Ansh Radhakrishnan for helpful conversations, comments, and/or feedback.
In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:
However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.
That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.
In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Before I do that, however, I want to go over the distinction between capabilities evaluations and alignment evaluations, as it'll be an important one throughout the post. Specifically:
In my opinion, if you want to craft a good governance scheme around model evaluations, you're going to need both capabilities and alignment evaluations. For example, a very simplified scheme here could be something like:
That's not to say that the above scheme is the best one (or even a good one)—I only provide it as an example of what the interplay between capabilities evaluations and alignment evaluations might look like.
Now we'll look at different evaluations and see:
Key assumption: no evaluation gaming.
We'll start with the most straightforward type of evaluation, behavioral non-fine-tuning evaluations—that is, just directly evaluating the model on some specific dataset, without any task-specific fine-tuning. Most current evaluations fall into this category, including e.g. our "Discovering Language Model Behaviors with Model-Written Evaluations" paper.
Evaluations in this category can be used for either capability or alignment purposes, depending on what you're testing. For example, evaluating the model to see how well it does on the SAT would be a capability evaluation in this category, whereas evaluating the model to see how easy it is to jailbreak would be an alignment evaluation.
Regardless, in my opinion, I think evaluations in this category will not be trustworthy or reliable for more advanced future AI systems. In particular, this style of evaluation is very easy for the model to game: since there's no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it's being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
Though this relies on the model being able to tell that it's being evaluated, the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it's in deployment, at least if it’s willing to be quite conservative such that it assumes it’s in evaluation by default unless it sees very compelling evidence that it’s in deployment. Paul Christiano's classic example of this is a model looking for a factorization of RSA-2048 (see "Conditional defection" here).
Thus, for behavioral non-fine-tuning evaluations to be trustworthy, you have to believe—for some reason unrelated to any behavioral non-fine-tuning evaluations that you've run—that the model isn't trying to game your evaluations. Any sort of situational awareness on behalf of the model here could lead to evaluation gaming, which means you likely have to believe that the model has no situational awareness.
As a particular subcase of this, you also have to believe that deceptive alignment is highly unlikely, since deceptive alignment is a phenomenon that would cause the model to try to game the evaluations. Since I think deceptive alignment is actually quite likely, I don't think that this method of evaluation can generally be reliable as we continue to scale up models—and even if you think deceptive alignment is less likely than I do, unless you’re very willing to stake the world on that assumption, we at least probably shouldn't entrench these sorts of evaluations in any sort of regulatory framework.
Key assumption: no exploration hacking.
A behavioral RL fine-tuning evaluation is one which:
For example, an evaluation that would fall into this category might be:
Importantly, as we will see later, using RL here makes this evaluation more robust than the first type (where we do no fine-tuning), but less robust than the next type (where we do i.i.d. fine-tuning), so it's worth noting why you would ever want to do RL fine-tuning rather than i.i.d. fine-tuning. The reason is that RL lets you fine-tune on tasks that you know how to evaluate but not generate: if you can reliably evaluate performance on some task, then you can produce an RL reward signal for it, but you can't necessarily produce i.i.d. supervised fine-tuning data for that task.
Additionally, evaluations in this category are primarily only useful as capability evaluations, rather than alignment evaluations, since you're not testing whether, by default, the model would want to autonomously spin up AWS clusters—rather, you're testing whether, if specifically trained on the task, the model is capable of doing so. That being said, as I discussed previously, capabilities evaluations are still quite important—so the question becomes, under what assumptions can we trust these sorts of evaluations?
The answer is that it depends. The basic case for why this might work is that, if we’re actively training the model to perform the task, then it shouldn’t be possible for the model to simply choose to sandbag the evaluation: if the model tried to sandbag the evaluation on purpose, gradient descent should just train away that behavior.
The problem that is most concerning in this case, however, is exploration hacking: that is, situations where the model prevents the RL process from ever exploring some action that would be highly rewarded simply by never taking that action. For example, suppose the model does have the capability to train models on AWS, but it is trying to game our eval, since we don't want to assume no eval gaming as in the previous case. In that case, the model could try to simply never produce any examples of cases where it successfully trains any models on AWS by default. In that case, since the model never produces any successful examples, there's nothing for RL to reward, and so RL fails to train the model to perform the task even if the model is actually capable of performing the task if it wanted to.
That being said, there are ways to mitigate exploration hacking problems. For example, in the previous case, you could mix in some supervised fine-tuning on examples of humans performing the behavior you’re trying to train the RL model on, such that the model is now being directly trained to sometimes produce examples of the desired behavior. Once the RL model produces some examples of the intended behavior, RL should have something to reward, and the eval should be able to function. Unfortunately, there's a major issue with this approach, which is that it requires humans to be able to generate examples of the intended behavior—but as we established previously, avoiding that was the whole reason that you might prefer an RL-based evaluation in the first place.
Fortunately, RL might still be quite helpful here as a form of bootstrapping: if you can only generate weak examples, but you want to see whether the model is capable of generating stronger examples, you could first supervised fine-tune on weak examples, then RL fine-tune to try to get the model to produce stronger examples. Even if the model is trying to game your evaluation, as long as it's hard for the model to always be able to tell how hard some case would be for the humans to demonstrate, the model won't be able to tell when it's safe for it to exploration hack, since it won't be able to generally determine when a case is part of the supervised fine-tuning and when it's part of RL.
Unfortunately, especially in the cases where you might want to try to use this style of evaluation as an alignment evaluation rather than a capabilities evaluation, telling the difference between cases where humans can demonstrate the behavior and cases where they can't could be quite easy. For example, we might want to try to RL the model on the task of doing mechanistic interpretability on itself to test for deceptive alignment, while only having demonstrations of deceptive alignment in other models. In that case, if the model saw that the model it was inspecting was deceptively aligned in a similar way that it was, it would know—conditional on us not having caught its own deception yet (which is the only case that should matter to it)—that such deceptive alignment isn't the type that humans are able to demonstrate examples of. Thus, even if we would hypothetically have rewarded the model for showing us evidence of its own deception, since the model never actually does that, we're never able to actually reward it for doing so, and thus the results of our eval—"we tried to train the model to look for deception in itself and it failed, thus the model must not be deceptive"—aren't reliable, even if the model did theoretically have the relevant mechanistic interpretability capabilities.
Key assumption: no gradient hacking.
A behavioral i.i.d. fine-tuning evaluation is one which:
As in the previous case, evaluations in this category are essentially only useful as capability evaluations, rather than alignment evaluations—you're not testing whether, by default, the model would want to build biological weapons, but whether, if specifically trained on the task, the model is capable of learning the knowledge necessary to do so.
Unlike in the previous case, however, exploration hacking isn't a problem here, since we aren't doing any exploration: all of our data is pre-generated and ensured to be independently and identically distributed.
Thus, for behavioral i.i.d. fine-tuning evaluations, I think the key assumption that you need to make them trustworthy is the absence of gradient hacking. Gradient hacking describes a situation where the model is able to think in particular ways that affect its own gradient. For example: if a model has a general policy of following its own pre-commitments, and it is able to pre-commit to fail horribly on the training task if it notices that its goal has been modified, then that creates an incentive for gradient descent to avoid modifying the model's goal, since modifying the model's goal would result in the model having bad performance on the training task.
Gradient hacking problematizes these sorts of evaluations because it describes a way in which a model could fail to be successfully fine-tuned on some task despite the model actually having the capability to do that task. Specifically, a model could gradient hack in such a way that results in the fine-tuning step not working—e.g. by using a pre-commitment mechanism like I described previously—despite the model being capable of performing the task if it actually wanted to later on.
In my opinion, however, I think gradient hacking is likely to be quite difficult, such that we can probably fully trust evaluations in this category until we are well into the superhuman capability regime. That's because gradient hacking seems to require quite complex knowledge of the mechanics of gradient descent applied to the model itself—the model has to be able to predict the effects of arbitrary perturbations to its own cognition and then change its cognition to modify what the effects of different perturbations might be. That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them. Absent such architectures, however, I suspect that this class of evaluations will likely be generally quite reliable.
Key assumption: ???
One thing that should hopefully be clear now is that we're missing something pretty substantial: the only types of evaluations that have the potential to be trustworthy and reliable in the future only really lend themselves to capabilities evaluations, not alignment evaluations. That is, while it is plausible that either of the classes of fine-tuning-based evaluations could be highly reliable, since they both involve fine-tuning the model on the task, they're both structurally set up primarily to evaluate whether the model is capable of performing some task when trained to do so, not how likely it is do some particular thing by default.
Fortunately, we don't need good alignment evaluations just yet: current models for the most part aren't yet capable enough of causing catastrophe even if they wanted to, so right now we can mostly use capabilities evaluations to gauge model safety. Eventually, however, as models become more capable, that will cease to be true, and we'll need some mechanism of robustly evaluating alignment.
This is where I think understanding-based evaluations come into the picture: if we evaluate the extent to which we understand our models, we might be able to produce an actually robust and trustworthy evaluation of alignment. For example, we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training to ensure that the model never becomes deceptive in the first place.
As of right now, however, as I discuss in "Towards understanding-based safety evaluations," we don't currently know how to effectively and reliably evaluate the extent to which we understand our models, leaving that as a critical open problem in this space.
This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it's hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?
For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.
Yes, that's right. In some sense they're evaluating different capabilities—both "can a model find a way to do this task" and "can a model do what humans do on this task" are separate capabilities, and which one you're more interested in might vary depending on why you care about the capability evaluation. In many cases, "can a model do this task the way humans do it" might be more useful, since e.g. you might care a lot if the model is capable enough to replicate complex human labor, but not really care at all if the model can find some weird hack.
The reasons I like this post:
"That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.")
Arguably missing is a line or two that backtracks from "we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training" to (my claim) "it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment, and we don't actually know when we're going to hit that threshold", but that might be out of scope.
it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment
it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment
I mean, like I say in the post, if you have some strong reason to believe that there's no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).
Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
... this style of evaluation is very easy for the model to game: since there's no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it's being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.
Perhaps there are some behavioral / black-box methods available for evaluating alignment, depending on the kind of system being evaluated.
Toy example: imagine a two part system where part A tries to do tasks and part B limits part A's compute based on the riskiness of the task. You could try to optimize the overall system towards catastrophic behavior and see how well your part B holds up.
Personally I expect monolithic systems to be hard to control than two-part systems, so I think this evaluation scheme has a good chance of being applicable. One piece of evidence: OpenAI's moderation system correctly flags most jailbreaks that get past the base model's RLHF.