Yeah great question! I'm planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit "bad behavior" i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something "bad". The useful thing about compute is that it "easily" allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output "bad stuff". But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.I'll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how "HHH" your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.Note there is a second axis which I have not higlighted yet, which is diversity of "bad outputs" produced by the target model. This is also measured in Ethan's paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I'm still thinking about how diversity fits in this picture.
Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.
Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can "easily" be resolved by making your hypothesis more specific. We are merely pointing out that "In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function."I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it's more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.Your suggestion of using DKL seems a useful improvement compared to most metrics. It's, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).
Thanks for your comments Akash. I think I have two main points I want to address.
Ultimately I'm not even sure whether there is a clear solution to this problem. The field is still very new and it's amazing what has already happened. It's probable that it just takes time for the field to mature and people getting more experience. I think I mostly wanted to point this out, even if it is maybe obvious.
My argument here is very related to what jacquesthibs mentions.Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather "junior" people at the moment. My sense is that 80K and most programs encourage people to up-skill and then try to get a job at a big organization (like Deepmind, Anthropic, OpenAI, Conjecture, etc.). Realistically speaking though, these organizations can only absorb a few people in a year. In my experience, it's extremely competitive to get a job at these organizations even if you're a more experienced researcher (e.g. having done a couple of years of research, a Ph.D., or similar). This means that while there are many opportunities for junior people to get a stand in the field, there are actually very few paths that actually allow you to have a full-time career in this field (this is also for more experienced researchers who don't get a big lab). So the bottleneck in my view is not having enough organizations, which is a result of not having enough senior researchers. Funding an org is super hard, you want to have experienced people, with good research taste, and some kind of research agenda. So if you don't have many senior people in a field, it will be hard to find people that fund those additional orgs.Now, one career path that many people are currently taking, is being an "independent researcher" and being funded through a grant. I would claim that this is currently the default path for any researcher who do not get a full-time position and want to stay in the field. I believe that there are people out there who will do great as independent researchers and actually contribute to solving problems (e.g. Marius Hobbhahn and John Wenthworth talk bout being an independent researchers). I am however quite skeptical about most people doing independent research without any kind of supervision. I am not saying one can't make progress, but it's super hard to do this without a lot of research experience, a structured environment, good supervision, etc. I am especially skeptical about independent researchers becoming great senior researchers if they can't work with people who are already very experienced and learn from them. Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.
In terms of career capital, being an independent researcher is also very risky. If your research fails, i.e. you don't get a lot of good output (papers, code libraries, or whatever), "having done independent research for a couple of years" will not sound great in your CV. As a comparison, if you somehow do a very mediocre Ph.D. with no great insights, but you do manage to get the title, at least you have that in your CV (having a Ph.D. can be pretty useful in many cases).
So overall I believe that decision makers and AI field builders should put their main attention on how we can "groom" senior researchers in the field and get more full-time positions through organizations. I don't claim to have the answers on how to solve this. But it does seem the greatest bottleneck for field building in my opinion. It seems like the field was able to get a lot more people excited about AI safety and to change their careers (we still have by far not enough people though). However right I think that many people are kind of stuck as junior researchers, having done some programs, and not being able to get full-time positions. Note that I am aware that some programs such as SERI MATS do in some sense have the ambition of grooming senior researchers. However, in practice, it still feels like there is a big gap right now.My background (in case this is useful): I've been doing ML research throughout my Bachelor's and Masters. I've worked at FAR AI on "AI alignment" for the last 1.5 years, so I was lucky to get a full-time position. I don't consider myself a "senior" researcher as defined in this comment, but I definitely have a lot of research experience in the field. From my own experience, it's pretty hard to find a new full-time position in the field, especially if you are also geographically constrained.
This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.
Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some degree (i.e. reduce it). But what is the reasoning why misgeneralization should be predictable, given the capacity and the knowledge of "in-distribution scaling laws"?Overall I hold the same opinion, that intuitively this should be possible. But empirically I'm not sure whether in-distribution scaling laws can tell us anything about out-of-distribution scaling laws. Surely we can predict that with increasing model & data scale the out-of-distribution misgeneralization will go down. But given that we can't really quantify all the possible out-of-distribution datasets, it's hard to make any claims about how precisely it will go down.
That's interesting!Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
OpenAI has just released a description of how their models work here. text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO). "FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks. Let me know if you want help on this, I'm interested in this myself.
I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action "up" all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don't fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.
Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward when it's not supervised anymore? I'm not trying to poke a hole in this, I'm generally just curious. The fact that it can behave out of distribution given all of its knowledge makes sense. But why will it specifically go up, and not down? I mean even if it goes down it still satisfies your criteria of a treacherous turn. But maybe the going up has something to do with this tendency of going up depending on the random seed. So probably this is a nitpick, but just something I've been wondering.
I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).