I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I'm also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
Similarly, Gemma 2 had its pretraining corpus filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.
It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.
It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.
We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the "User: {Prompt}\n\nAssistant: " template. This is also reflected in their high standardized benchmark scores -- the "base" models do comparably to the instruction finetuned ones! In other words, Qwen2 "base" models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn't be surprised if the same were true of the 1.5 models.
I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the "unsafe completions"
I don't know what's going on with LLaMA 1, though.
Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/
(The Anthropic paper I cited predates ChatGPT by 7 months)
Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862
But yes, people complained about it a lot at the time
Thanks for the summaries, I found them quite useful and they've caused me to probably read some of these books soon. The following ones are both new to me and seem worth thinking more about:
- You should judge a person's performance based on the performance of the ideal person that would hold their position
- Document every task you do more than once, as soon as you do it the second time.
- Fun is important. (yes, really)
- People should know the purpose of the organization (specifically, being able to recite a clear mission statement)
- "I’m giving you these comments because I have very high expectations and I know that you can reach them"
A question I had while reading your notes -- it seems like people fail at implementing many best practices not because they don't think the practices are good, but because of a lack of capacity. For example, there's an entire cluster of that basically boil down to "people do better with fast feedback":
- After a task is delegated, make sure that it's progressing as intended.
- After a task is completed (or failed), keep people accountable.
- Make sure to check in on goals in regular time intervals.
- Provide positive reinforcement immediately.
- Provide negative feedback immediately.
These require that managers be very attentive to the going-ons and constantly on top of the state -- but when there are other priorities, this might be pushed back. Do the books also talk about what not to do, such that you'll have the slack to implement best practices?
Also, a typo:
- Use OKRs (objectives and key results) and check if you're meeting them regularly. Switch them up often to avoid goodhearting.
goodhearting -> Goodharting
Thanks for writing this!
I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.
I actually think the proposal is more general than just for preventing AI escapes during diverse evals -- you want to start with low surface area tests because they're cheaper anyways, and you can use the performance on low surface area tests to decide if you want to do more thorough evals.
I imagine a proper approach is something like:
You'd want to run 1 + 2 all the time, and begin running 3 once the model passes 1 until it passes 2.
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.