Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Creating more visibility for a comment thread with Rohin Shah.)

Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:

  1. Running a simple eval on the post-RLHF model would reveal a much lower ELO than if you ran it on the base model, because it would generally find a way to lose. (In this example, you can imagine the red team qualitatively noticing the issue, but the example is an artificially simple one!)
  2. The post-RLHF model still has much of its chess knowledge latently available, in order to put up a good fight across the full range of human ability. Possibly it's even superhuman at chess—I know I'd have to be better than you at chess in order to optimize well for an entertaining game for you. But that won't show up in its ELO.

So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against (1), and I'd love to be reassured either that this is unnecessary for some really obvious and ironclad reason, or that someone is already working on this.

And I don't have any good suggestion on (2), the idea that RL*F could reinforce a capability while also concealing it.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 1:18 AM

I think this post is mostly off-base about the claim that it's important to run evals on base models, though I can see where it's coming from.

This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.

Seems mostly false to me for current post training processes for many capabilites we care quite a bit about (agency, software, etc.). RLHF doesn't train these capabilities away because these are key use cases.

There is a pretty serious concern here for tasks which are obviously a bit "evil" like bioweapons, cybercrime, etc. I think by default models might refuse to perform well or not try hard on these tasks after RLHF. (See also discussion of refusals here.) However, I think small amounts of fine-tuning are likely to suffice for making models not intentionally perform poorly in "evil" domains.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer.

It's worth noting that the typical case we care about (e.g. software engineering ability) is more like using RL to make the model better at chess because labs want their models to be good at these tasks. (At least right now.)

So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against

By default, I think base models perform poorly on large open ended tasks relative to models RL'd to perform well (based on METR's results linked in another comment and some general conceptual analysis). I think this will persist even with substantial prompting.

So, sadly, I don't think that running evals on the base model is a good precaution for RLHF hiding capabilites. I think RL and finetuning to follow instructions and perform well are likely needed.

What is a good precaution is to run evals on a model trained to just always perform as well as possible with no safety or other restrictions. (Sometimes such models are called "helpful-only".)

The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)

I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.

METR (formerly ARC Evals) included results on base models in their recent work "Measuring the impact of post-training enhancements" ("post-training enhancements"=elicitation). They found that GPT-4-base performed poorly in their scaffold and prompting.

I believe the prompting they used included a large number of few-show examples (perhaps 10?), so it should be a vaguely reasonable setup for base models. (Though I do expect that elicitation which is more specialized to base model would work better.)

I predict that base models will consistently do worse on tasks that labs care about (software engineering, agency, math) then models which have gone through post-training, particularly models which have gone through post training just aimed at improving capabilities and improving the extent to which the model follows instructions (instruction tuning).

My overall sense is that there is plausibly a lot of low hanging fruit in elicitation, but I'm pretty skeptical that base models are a very promising direction.

Thank you! I'd forgotten about that.

I expect you'd instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it's to be evaluated on.

I agree, and one of the reasons I endorse this strongly: there's a difference between performance at a task in a completion format and a chat format[1].

Chess, for instance, is most commonly represented in the training data in PGN format - and prompting it with something that's similar should elicit whatever capabilities it possesses more strongly than transferring them over to another format.

I think this should generalize pretty well across more capabilities - if you expect the bulk of the capability to come from pre-training, then I think you should also expect these capabilities to be most salient in completion formats.

  1. ^

    Everything in a language model is technically in completion format all the time, but the distinction I want to draw here is related to tasks that are closest to the training data (and therefore closest to what powerful models are superhuman at). Models are superhuman at next-token prediction, and inasmuch as many examples of capabilities in the training data are given in prose as opposed to a chat format (for example, chess), you should expect those capabilities to be most salient in the former.