Note: I'm pretty uncertain of my conclusions in this post, and want to hear other people's thoughts on this.

It seems possible to bootstrap language models to some degree. What are the safety implications of this?

For example, you could:

  1. Ask a language model to write prompts for short stories

  2. Give the language model these prompts and have it generate short stories

  3. Use these new short stories as training data, thus improving the models ability to write short stories (optionally, you could select for good short stories and train only on those)

  4. Repeat until the model is really good at writing short stories.

Would this lead to arbitrarily large increases in capability?

In the case where we don't select for well-written short stories, I would guess no. I would expect that a technique like this would improve the model's ability to write good short stories to a limited extent at the cost of getting worse on unrelated tasks.

Conceptually, retraining a model on a subset of possible outputs seems like it would bias the model to produce similar output in the future; a form of fine-tuning. I would expect that this would increase the model's performance at related tasks (e.g. write long stories) and reduce its performance on unrelated tasks (making lists?).

At best, the language model produces short stories similar to the original training data. After it trains on the newly generated data, the new short stories should be similar in quality to the output it was producing before (slightly better, because it has been fine-tuned to produce short stories). It seems hard to produce training data with significantly higher quality overall without filtering the output somehow.

In the case where we filter the output to produce high quality training data, it seems one could increase capability. To obtain a large amount of high-quality data, the model needs to produce good short stories frequently enough, and the filter must do a good job of selecting these stories.

The rate at which a model can produce good short stories depends on how much overlap it's output distribution has with the hypothetical output distribution of a model trained on high-quality data. As the model improves, it will produce high-quality data with higher frequency.

The quality of the filter determines the upper-bound of performance. If the filter cannot discriminate high-quality stories from extremely-high-quality stories, it seems that performance will increase until all the stories produced by the model are indistinguishable to the filter. If humans are doing the filtering, then the bootstrapping process will be limited by people's ability to differentiate between really good short stories.

Combining these two factors, I would expect a bootstrapped model to see initial accelerating returns that asymptote to the upper bound determined by the quality of the filter.

So overall, bootstrapping seems like it could be useful for fine-tuning a model or improving that model given a method to filter for quality. But relative to other techniques like scaling, it seems less likely to lead to dangerous increases in capability.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 3:36 PM

The quality of the filter determines the upper-bound of performance. If the filter cannot discriminate high-quality stories from extremely-high-quality stories, it seems that performance will increase until all the stories produced by the model are indistinguishable to the filter. If humans are doing the filtering, then the bootstrapping process will be limited by people's ability to differentiate between really good short stories.

The quality of the filter certainly matters but this is a narrow way to look at it, and the filter may not be all that important.

When I think about the tangled topics of expert iteration, self-play, self-distillation, inner-monologue, finetuning, meta-learning, sim2real, domain randomization, iterated amplification, etc, I tend to ask, "where is new information coming from?" The quality of the filter is far downstream of the big picture.

For example, in AlphaGo/MuZero, there is not really any 'filter', and also we know that the bootstrapping process doesn't asymptote at any "upper bound determined by the quality of the filter". (It doesn't really 'asymptote' at all, inasmuch as far as we know, you can just keep pouring on the compute to train a bigger model to distill even more accurate tree search, and available evidence like Jones shows similar scaling law behavior.) AG doesn't need any new information because all the complexity of Go is available by interrogating the simulator; MuZero learns that, but it only needs to learn a little bit, the rules of Go, and it can do that either by some offline learned data like human games if diverse enough to represent all relevant state transitions/rewards, or carefully targeted environment interactions to play games triggering parts of the rules it's unsure of, and then it's off to the races. (Indeed, an agent hypothetically could solve Go to superhuman level even without adequate data to nail down the rules, if it has enough compute, because it can treat it as a meta-learning problem: you have a distribution over possible rules of Go, which you then attempt to master many versions of; then after you are done, you sit down with a human and perhaps use your text module to enquire, "by the way, what's komi these days?" "7.5." "Ah, thanks.", and, having figured out which MDP your POMDP plopped you down in this time, then proceed to crush them. If it's not as simple as a binary uncertainty between 6.5 and 7.5 komi where you just train half the game conditional on a komi feature set to 6.5 and the other half with it set to 7.5, you can do posterior sampling and sample a possible environment-model/rule-sets to diversify training over.) We can imagine this for a lot of other situations: for example, a Codex/AlphaCode model could potentially self-train by learning to emulate a REPL and then using its GPT-3-based natural language to pose itself arbitrarily many problems. (AlphaCode in particular could probably use way less than "tens of thousands" of generated samples per problem if it did some finetuning on its self-ranked solutions.)

On the other hand, you could look at OA's GPT-3 French self-distillation work where it prompts for French translations from a large model and finetunes a small model on it, getting the small model to translate much better than before with no need for few-shots, and note that the sample translations of straightforward sentences can contain only very little information about the French language; the small GPT-3 can't have learned all that much about French from it. French is not an algorithm which can generate new information by being executed, nor is either GPT-3 talking to a French person, nor are they looking up French text like WebGPT can, nor can French be nailed down to a few possible tasks which can be learned autonomously with no further information; it's a completely closed circle. So why did it work? Because the small GPT-3 is dumb and isn't sure which French task it is predicting: it ensembles over countless sub-models each encoding a possible 'French task', and the right one is towards the top of the rankings, but not the very top. The finetuning on GPT-3 samples, from a GPT-3 that has figured out which French task it is predicting, is enough to boost the good sub-models to victory. This boost might have taken a very long time if you had instead kept training the small GPT-3 on random Internet data (French being a small % of it, and texts providing French-English pairs a small % of that), but the very targeted boost lets the small model match the model orders of magnitude bigger. The capability was there, lurking under the surface, even if the waters seemed calm, waiting for a motive to strike.

This is also true of approaches like on a corpus of questions doing best-of=20 sampling, taking the majority vote (because so many samples will go awry but the majority will be more right), and retraining with the majority vote as the 'groundtruth', to try make it generate the majority completion every time. Simply babbling 20 times and convincing yourself that the thing you babbled 11⁄20 times is THE TRUTH™ cannot bring in any new information, but it can penalize bad sub-models and boost good ones to lock in correctness.

Like a surprise shark attack, however, this is a trick that only works once. Once the right sub-model has been boosted, then you're done. If its best is not good enough, then it's not good enough. There might be some benefits from looping a few more times to tidy things up or lock in the gains, but you cannot amplify yourself arbitrarily. (You either know your French vocab or you don't, and all the thinking in the world will not tell you the arbitrarily-decreed grammatical gender of all French nouns if you didn't at least get a hint of it during your original training.)

From this perspective, prompts like "English to French:" or "let's think this out step by step" are something of a 'prompt overhang': we (still) don't take prompt engineering seriously enough, so we think you have to have things like InstructGPT or elaborate few-shot 'Chain-of-Thought' prompts to induce inner monologue reasoning; unsurprisingly, we always underestimate the true performance of models when we do things in this careless cackhanded sort of way. (Remember: "sampling can prove the presence of knowledge but not the absence.") Hopefully over the next few years we'll get better at this, now that we know to expect these things and can begin automating them or surfacing them, and there will be fewer sharks just below the surface.

(Note to what an extent any 'filter' involved in all of this is a complete afterthought. In some cases like self-distillation of an ensemble's predictions, there is nothing even remotely like a 'filter': all inputs go in, the average of all outputs comes out.)

But relative to other techniques like scaling, it seems less likely to lead to dangerous increases in capability.

So how dangerous is this whole family of approaches? Well, that depends...

  • Where does new data come from? Is it accessible, or old data in such quantities as to be new?

  • Is this a task where additional information can be generated by computing, like a simulated game or robotics task? Does 'scaling' here encompass one's self-play and expert-iteration?

  • Is this a task where a relatively small amount of data could enable in silico thinking, such as meta-learning?

    • Is this a task where a few very carefully-chosen, Bayes-optimal, design-of-experiment-style samples can eliminate uncertainty and a model go from working in silico to working IRL near-instantaneously?
  • Is there an 'overhang' where a large model is smart enough to do it by default but a small model could do it and just happens to not do it (yet)?

    • Is this a task where the model has been trained specifically on heavily, so that there are no sub-models lurking below the water's surface, and WYSIWYG?

So a concrete example would be a Codex model. It's trained to generate code. But it could also be used to discover vulnerabilities & hack stuff. A Codex model will have learned a lot about hacking simply because many patches are security fixes, many pieces of code are security tools, there's a lot of discussion in comments etc, but it's not in any way specialized for hacking; that is merely one of the myriads of capabilities it's picked up along the way as part of its pretraining. You could prompt it to try to write a hack of some code. It'll sometimes do it but especially if the prompt is unnatural, this is not very likely to work. If you scale up Codex much further, it'll get better, and it will probably do so at a rate that looks power-law/logarithmic like everything else, so you'll need to scale it 10x or 100x to get large hacking gains. It will get better because it's gotten better at - along with everything else coding related - picking up what an odd prompt means, and because it's done better at locating coding & hacking sub-models, and this may achieve your goals. But those sub-models might well be the same ones that you could boost to the surface with some self-distillation on the original small model. On the other hand, maybe it hasn't learned those sub-models well enough to make them realistically locatable and there is a hard phase transition underneath the surface which hasn't yet happened, in which case its hacking performance won't improve much at all. Which one is it?

¯\_(ツ)_/¯ I have no way of knowing. It's a purely empirical question. The 'distance' between small models and large models will depend on how large the scales are, where the phase transitions in knowledge/capability are, where you can trade compute / data, how incidental the target task is to the pretraining task which creates a gap between latent capabilities and obviously-promptable capabilities, how good your search and self-distillation techniques are...

I don't find this particularly comforting from a safety perspective because it implies that differences in size which look impressive could in fact be highly illusory. You could easily have a model 100x smaller - "wow, that seems very safe and harmless, we'd better worry about the large one" - which nevertheless can, when prompted just the right way, beat the large one.

Good points. Here is my rough summary:

1. The way a bootstrapping approach filters responses can be relevant, but often isn't the best way to think about the problem.

Consider self-play. Here, the model isn't taking in new outside information, it just keeps practicing internally until it's very good at a task.

This isn't limited to getting good at a single task. A self-play agent can practice every possible task, and once it's been assigned a real task it can simply "load" all of it's training relevant to that task.

2. Generally speaking, a larger model may already have a sub-model which is performant on a specific task. Bootstrapping, fine-tuning, or ensembling answers can all potentially "unlock" that sub-model, leading to high performance from bootstrapping.

This approach only works one time; once the sub-model is found, it shouldn't improve much afterwards without new information.

Today, we haven't really used this property very well. All of those papers about how changing the prompt results in large increases in performance are a sign that large language models have these sub-models, and we could do more to unlock them.

3. Whether or not bootstrapping is dangerous depends on a lot of factors. Is there some sort of self-play that can provide performance gains without new information? Does the model contain a performant sub-model that just needs to be unlocked? Has the model already been fine-tuned enough?

Example: code-writing language models might learn how to exploit code vulnerabilities and do bad things. It might require scaling the model in order to do this, or it may be enough just to bootstrap the model. Whether or not bootstrapping can achieve this is an empirical question.

4. Overall, this could be dangerous. Small models might look safe, but could be hiding dangerous sub-models which can be unlocked using bootstrapping.


I agree with these points. Some comments:

Point #1: Self-play is an interesting case, here the model output is moves in a game, and the "filter" is the opponent (for self-play the opponent uses the same model). The opponent discriminates between good and bad strategies by either winning or losing (and appropriately attributing that outcome to different moves/contexts).

I still think the generator-filter model can apply here.

Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can't discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.

Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information. Now the filter can't discriminate between random opponents and human opponents, so performance never improves. (This is an extreme case, I expect models with a little bit of training to improve somewhat on contact with a better opponent).

So self-play is a good middle-ground where the filter is well suited to discriminate between slightly-better and slightly-worse opponents, leading to steady improvement.

What are the limitations of self-play? Like before, the model has to generate enough good sequences in order to learn from them, and the opponent has to properly reward these sequences. If the outcomes of the game are somewhat random, this should slow down training by making the filter noisier.

For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.

Point #2: This roughly corresponds to my point about bootstrapping models with no filter. I would expect that the performance of the sub-model is limited by the amount of relevant training data learned before bootstrapping. The scope of "relevant training data" can be large, e.g. data on English grammar can help with French translation even if it isn't directly related to French.

Point #3: This suggests a way to get more robust AI. When deploying a fine-tuned model, make sure that it has received enough fine-tuning/bootstrapping so that it's converged in some sense. This makes it less likely that it exhibits sudden changes in performance in the real world. All else equal, smaller models with less training data are probably more stable in this regard.

Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can't discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.

This is going to depend on what sort of model and training regime we are talking about, and how flexible you are in finding some component to label a 'filter'.

Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins half the time initially, creating fitness gradients between winner & loser but it quickly homes in on some very simple tricks which let it defeat the random baseline ~100% of the time. Then, because there are no longer any fitness gradients, learning immediately halts there. The model successfully learns, but as little as possible. If the mutation rate (learning rate) doesn't decay, it will wander around model space, only periodically purging bad mutants to maintain minimum adequacy; given enough time maybe it'd do something like 'survival of the flattest' in finding a basin (cf. grokking) but who cares, it'll still be terrible.

Policy gradients like PPO would also do this (probably?).

Consider a model-free value agent like DQN. It observes all of the pairs of state transitions, bootstrapping rewards. It does better than evolution strategies because it keeps propagating rewards back through moves and keeps changing strategies instead of halting as soon as it beats the baseline, and randomized games keep exposing it to new situations and errors in its value functions. It asymptotes at pretty bad play, probably, but it would be hard to predict in advance how bad, exactly: eg. we know that something like TD-Gammon can do very well but doesn't seem to do well for Go, and in retrospect, people usually tell a story about how the inherent randomization of dice in backgammon 'smooths the value function' and 'forces exploration' compared to Go/chess despite the instability of self-play/random baselines, and for any given problem/baseline opponent, I'm not sure how well a priori people would be able to predict performance.

Consider a model-based agent like MuZero learning the game rules from the random opponent. It observes all of the state transitions, infers an environment, goes off and does self-play for a long time, periodically coming back to play the random agent; sometimes it wins, sometimes it loses, and it does so deliberately because it's looking at the final reward trying to figure out what komi is. After some exploration it's done, and it bootstraps to superhuman skill. This model plays only random opponents (aside from fake hallucinated self-play games), but successfully learns.

Consider a model-based tree search agent with a simulator, like MCTS. It doesn't learn, it only plans. It ignores the random games and then uses up arbitrary amounts of compute at play-time to search so deeply it defeats the superhuman opponent. This model doesn't fail to learn because it didn't try to learn.

Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information.

Also depends.

Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins its game every time, receiving final rewards of 0; with no difference in fitness across all mutants, there is no covariance with changes in the model and no evolution. This model does indeed fail to learn, and will simply jitter around randomly in model space. (Policy gradients like PPO might do something a little different depending on whether they can use baselines to define 'played better than usual in this game', like with reward shaping on length of game / territory.)

But the episodes are not uninformative even if they always result in defeat. The results of the games may be predictable (and algorithms looking only at the final return will do poorly), but the moves themselves are not. They are very informative. In fact, you are receiving about 80 very valuable labels per game: the best possible move for 80 board states.

A straight behavior-cloning model would find this very informative, and the more times it trains & plays, the better it will get - this is in fact an ideal scenario for expert iteration, because you have on hand an expert which will tell you the exact right move in every other move of every game no matter how good you get. Likewise, an AlphaGo/Zero agent will find it valuable: the superhuman opponent mercilessly samples board positions where it has misestimated something, and it needs to correct itself by deeper search.

For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.

Unless the model is big enough to solve the game, it will have to asymptote. (Which is why you have to scale data/compute/size appropriately, to avoid bottlenecks.)

For investigation of the kind of thing you suggest, take a look at Anthropic's "A General Language Assistant as a Laboratory for Alignment" and more importantly "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback".

They focus on training a helpful / harmless assistant rather than good short stories, but using human-filtered model-output to improve behavior is the basic paradigm.

Thanks for the pointer, I will check that out!