Thanks for your comments Akash. I think I have two main points I want to address.
Ultimately I'm not even sure whether there is a clear solution to this problem. The field is still very new and it's amazing what has already happened. It's probable that it just takes time for the field to mature and people getting more experience. I think I mostly wanted to point this out, even if it is maybe obvious.
My argument here is very related to what jacquesthibs mentions.
Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather "junior" people at the moment. My sense is that 80K and most programs encourage people to up-skill and then try to get a job at a big organization (like Deepmind, Anthropic, OpenAI, Conjecture, etc.). Realistically speaking though, these organizations can only absorb a few people in a year. In my experience, it's extremely competitive to get a job at these organizations even if you're a more experienced researcher (e.g. having done a couple of years of research, a Ph.D., or similar). This means that while there are many opportunities for junior people to get a stand in the field, there are actually very few paths that actually allow you to have a full-time career in this field (this is also for more experienced researchers who don't get a big lab). So the bottleneck in my view is not having enough organizations, which is a result of not having enough senior researchers. Funding an org is super hard, you want to have experienced people, with good research taste, and some kind of research agenda. So if you don't have many senior people in a field, it will be hard to find people that fund those additional orgs.
Now, one career path that many people are currently taking, is being an "independent researcher" and being funded through a grant. I would claim that this is currently the default path for any researcher who do not get a full-time position and want to stay in the field. I believe that there are people out there who will do great as independent researchers and actually contribute to solving problems (e.g. Marius Hobbhahn and John Wenthworth talk bout being an independent researchers). I am however quite skeptical about most people doing independent research without any kind of supervision. I am not saying one can't make progress, but it's super hard to do this without a lot of research experience, a structured environment, good supervision, etc. I am especially skeptical about independent researchers becoming great senior researchers if they can't work with people who are already very experienced and learn from them. Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.
In terms of career capital, being an independent researcher is also very risky. If your research fails, i.e. you don't get a lot of good output (papers, code libraries, or whatever), "having done independent research for a couple of years" will not sound great in your CV. As a comparison, if you somehow do a very mediocre Ph.D. with no great insights, but you do manage to get the title, at least you have that in your CV (having a Ph.D. can be pretty useful in many cases).
So overall I believe that decision makers and AI field builders should put their main attention on how we can "groom" senior researchers in the field and get more full-time positions through organizations. I don't claim to have the answers on how to solve this. But it does seem the greatest bottleneck for field building in my opinion. It seems like the field was able to get a lot more people excited about AI safety and to change their careers (we still have by far not enough people though). However right I think that many people are kind of stuck as junior researchers, having done some programs, and not being able to get full-time positions. Note that I am aware that some programs such as SERI MATS do in some sense have the ambition of grooming senior researchers. However, in practice, it still feels like there is a big gap right now.
My background (in case this is useful): I've been doing ML research throughout my Bachelor's and Masters. I've worked at FAR AI on "AI alignment" for the last 1.5 years, so I was lucky to get a full-time position. I don't consider myself a "senior" researcher as defined in this comment, but I definitely have a lot of research experience in the field. From my own experience, it's pretty hard to find a new full-time position in the field, especially if you are also geographically constrained.
This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.
Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some degree (i.e. reduce it). But what is the reasoning why misgeneralization should be predictable, given the capacity and the knowledge of "in-distribution scaling laws"?
Overall I hold the same opinion, that intuitively this should be possible. But empirically I'm not sure whether in-distribution scaling laws can tell us anything about out-of-distribution scaling laws. Surely we can predict that with increasing model & data scale the out-of-distribution misgeneralization will go down. But given that we can't really quantify all the possible out-of-distribution datasets, it's hard to make any claims about how precisely it will go down.
That's interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I'm interested in this myself.
I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action "up" all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don't fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.
Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward when it's not supervised anymore? I'm not trying to poke a hole in this, I'm generally just curious. The fact that it can behave out of distribution given all of its knowledge makes sense. But why will it specifically go up, and not down? I mean even if it goes down it still satisfies your criteria of a treacherous turn. But maybe the going up has something to do with this tendency of going up depending on the random seed. So probably this is a nitpick, but just something I've been wondering.
I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).
Would also love to have a look.
I think the terminology you are looking for is called "Deliberate Practice" (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you "just do your research" you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into learning, you will probably switch to "deliberate practice mode".
Deliberate practice is the very intentional action of improving your skill, e.g. sitting down on a piano and improving your technique, learning a new piece. Or improving as a writer by doing intentional exercises, or solving specific math problems that improve a certain skill.
The advantage of deliberate practice is that its main goal is to improve your skill. Also usually you are at the edge of your ability, pushing through difficulties, making the whole endeavor very intense and hard.
So yes, I agree that doing research is important. Especially if you have no experience then getting better at research is usually best done by doing research. However, you still need to do other things that specifically improve subskills. Here are a few examples:
I think by adding terminology I just wanted to make explicit what you mention in your post. It will also make it easier to find resources given the word "deliberate practice".
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can "easily" be resolved by making your hypothesis more specific. We are merely pointing out that "In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function."
I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it's more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.
Your suggestion of using DKL seems a useful improvement compared to most metrics. It's, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).