Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

In a few months, I will be leaving Redwood Research (where I am working as a researcher) and I will be joining one of Anthropic’s safety teams.

I think that, over the past year, Redwood has done some of the best AGI safety research and I expect it will continue doing so when I am gone.

At Anthropic, I will help Ethan Perez’s team pursue research directions that in part stemmed from research done at Redwood. I have already talked with Ethan on many occasions, and I’m excited about the safety research I’m going to be doing there. Note that I don’t endorse everything Anthropic does; the main reason I am joining is I might do better and/or higher impact research there.

I did almost all my research at Redwood and under the guidance of the brilliant people working there, so I don’t know yet how happy I will be about my impact working in another research environment, with other research styles, perspectives, and opportunities - that’s something I will learn while working there. I will reconsider whether to stay at Anthropic, return to Redwood, or go elsewhere in February/March next year (Manifold market here), and I will release an internal and an external write-up of my views.

Are board members working full-time on being board members at OpenAI? If so, I would expect that they could take actions to alleviate their lack of technical expertises by spending 15h/week to get up to speed on the technical side, reading papers and maybe learning to train LMs themselves. It naively seems like AI is sufficiently shallow that 15h/week is enough to get most of the expertise you need within a year.

Fabien RogerΩ17306

Cool work!

I wouldn't be that surprised if it required many more samples to fine-tune 3.5 to play chess with weird formatting, since it doesn't really "know" how to do it with other formats by default, but I'd be surprised if it took a huge amount of data, and I'm curious exactly how much fine-tuning is required.

I think the title oversells the results. Would you mind changing the title to "When fine-tuning fails to elicit GPT-3.5's chess abilities"? I think it's wrong, given the current evidence you provide, to state that 1. fine-tuning isn't enough (maybe you just haven't done enough) and 2. that this is a general phenomenon, as opposed to something very specific to GPT-3.5's chess abilities - which seems likely given that 3.5 was likely trained on a huge amount of chess with this exact notation, which might make it actually dumber if you use a different format, a little bit like GPT-4 being dumber if it has to write in base64.

A few questions about empirical details:

  • How big is N in best-of-N fine-tuning? Do you expect the same result with "gold samples" fine-tuning? Or is a large fraction of the gap mostly an exploration issue?
  • What are the SF L1 and SF L5 models?
  • How many samples did you use for fine-tuning? Did you do any hyperparameter search over learning rate and number of epochs? (In our password-locked model experiments, fine-tuning is severely underpowered if you don't fine-tune for enough epochs or with the wrong learning rate - though for the reasons described above I expect a much worse sample efficiency than in the password-locked models experiments).
  • Did you run a single fine-tuning run per setup? I've had some variability between fine-tuning runs in my steganography experiments, and I found that using multiple seeds was crucial to get meaningful results.

I would also be curious to know what happens if you use full-weight fine-tuning instead of the very light LoRA that OpenAI is probably using, but I guess you can't know that. Maybe if you trained a model from scratch on 50% chess 50% openwebtext you could do such experiments. It might be the case that a big reason why fine-tuning is so weak here is that OpenAI is using a very underpowered LoRA adapter.

nothing involving passwords

Agreed. We think the results probably also hold for models that were trained to imitate weak behavior, as long as the strong behavior is still "there in the latents" (you didn't just butcher the model). Passwords just make all experiments more convenient because it enables us to elicit this latent knowledge.

I'm not sure how your explanation works, let me try to say it in my own words: the LLM knows the truth. It is trained on a wide distribution to generate answers that follow a weird convention. It is now trained to memorize the weird convention on 2 samples. It concludes that the convention is rare and that it doesn't apply to the wider distribution of problems - because it forgets it saw the weird convention on the wider distribution?

If it's because it forgets, and you have no explanation of why it forgets, I'm not sure what this adds on top of the very high-level "fine-tuned model can catastrophically forget what they were fine-tuned to do (while remembering pretraining knowledge / "the truth" very well)" which I don't know a good explanation for.

Or is there something else where is remembers that it was trained to follow conventions on the whole distribution, but just reverts to the "true" answers just because? (I don't know how this works.)

(Note that weak sample unlocking is a phenomenon that we observed with very weak LLMs (7B models, GPT-2 trained from scratch, toy models trained from scratch) that don't have "internal thoughts" close to a human internal monologue, and training isn't well describe as "making an LLM think about how the world is different". But 1. I wouldn't be extremely surprised if there was some very shallow and simple bag of heuristics which approximate human internal monologue well for this sort of question 2. I'm curious how this works in cases where LLMs are doing this sort of explicit thinking.)

That is a clear limitation of our experiments (and we make it explicit both in this post and in the paper).

I think our experiments are still a good reason to update (a little) even for scheming threat models:

  • We see similar (though weaker) results when training models from scratch in a toy task;
  • I don't see a strong reason why the behavior and sample efficiency of a fine-tuned model should be radically different from behavior that arises naturally from pretraining (though I expect significant differences because fine-tuning makes password-locking way too salient). The fact that fine-tuning on 1 correct generation provides many bits of evidence against the "lock" applies just as much for our password-locked models as for scheming models.

My biggest uncertainty is if SGD will be able to find a part of a scheming LLM to modify that would weaken the lock (e.g. by making it less schemy, or giving it some urge to give good answers, or updating its world model on the general kind of information humans have access to). Sadly, I think this is the kind of uncertainty that is very hard to resolve, though some projects (such as the ones in "build better password-locked models" in the Future work section) may give some (small) evidence about this (up or down).

You're right, we don't run few-shot prompting experiment.

Few-shot prompting is a prompting strategy, so it has all the regular weaknesses of prompting schemers (a schemer can decide to not follow the pattern of the few-shot prompt).

When I wrote "Can fine-tuning elicit LLM abilities when prompting can't?" I meant "conditional on prompting failing (in our case because we trained them to), how well does fine-tuning work?". Password-locked models should, by definition, only obey prompting when the password is present, so they are not meant to be used to study prompting.

(We didn't actually bother making our models robust to all kinds of few-shot prompting and jailbreak attacks, so I expect that our models are weak against those, but I don't think such experiments would tell you much about what would happen with schemers.)

(One additional data point: our MMLU always has a 1-shot prompt that we use when training the locked model and when evaluating it, so the model is robust to 1-shot prompting. In general, it's pretty trivial to train models to ignore few-shot instructions if you bother trying.)

A few elements that support that few-shot prompting is "just prompting": few-shot prompting is sometimes randomly weak, and it works even if you shuffle labels

What would you hope to learn from few-shot prompting experiments?

[Edit: I was a bit rude in my original answer, I fixed it.]

What attack budget are you imagining defending against?

Rosati 2024 looks at fine-tuning for 1 epoch on 10k samples, which is a tiny attack budget relative to pretrain. If your threat model is the open source community unlocking HHH models, then the attack budget could be at least $1M, maybe much more. If the threat model is China or large terrorist groups, then you should probably be looking at a budget closer to 1%-10% of the cost of training a model from scratch. I have thought about defending against the latter threat, and I don't see a promising path towards making it hard for such well-funded attackers to fine-tune LLMs (including hard to fine-tune in general, not just domain-specifically).

Fabien RogerΩ231

These vectors are not "linear probes" (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors

I think DIM and LR aren't spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that "steering vectors" is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).

Load More