Alignment Fine-Tuning: Lessons from Operant Conditioning

foodforthought

tl;dr: I am a neuroscientist with 20 years of experience using operant conditioning to train rats and mice. At the suggestion of Alex Turner (@turntrout), I reviewed the basic theoretical principles of operant conditioning in animal behavior in relation to (what I could glean about) current practices in alignment fine-tuning training, and compiled a list of potentially underleveraged insights. Incorporating Alex’s feedback regarding which seemed most useful, I have elaborated on the top six ideas here, listed in descending order of enthusiasm.

I am indebted to Alex for feedback on drafts, but I am responsible for all views expressed.

This primer on operant conditioning might be useful background.

Edited: in the version I originally posted, I was imagining that LLM models continue to update their weights after deployment in response to individual user feedback (personalized RLHF). On the basis of initial feedback, I take this is not currently common. Edits were required throughout the document to reflect my updated understanding that at least for now, model weights are generally fixed at deployment and not modified by individual users. Thoughts specific to the scenario where users modify weights are still included, but have been moved to the end of the priority list.

1. Train when young

Animal training insight

Operant conditioning is most effective in very young animals, at least for mammals. In rats, we find that just-weaned animals can pick up arbitrary perceptual discrimination tasks reliably and quickly, whereas adults do so unreliably and slowly. For pet dogs, “puppy kindergarten” is a training in which very young puppies learn basic pro-social behavior and obedience, which very much improves the outcomes of later training in more complex tasks. Comparable training of adult dogs is more difficult and the learned behaviors are more brittle.

Puppy Kindergarten — Puppy kindergarten: animals learn pro-sociality and corrigibility best when introduced very early in development

This may be in part because it’s easier to train neural circuits that are still relatively naive. But my speculation is that it’s more due to the fact that mammals are programmed to have high neural plasticity during the time they are dependent on parents. This would allow for more reliable inheritance of learned behaviors. To elaborate on that: innate behaviors have been advantageous for long enough for hard-coded neural wiring to have been selected evolutionarily. Those behaviors are very sticky (hard to override by experience). For animals that are born reliant on parental care, parents can teach offspring additional learned behavioral contingencies that have been advantageous in their local environment over recent generations (by virtue of the fact that those with other behavioral contingencies did not become parents). When adult offspring become independent, learning allows them to continue to acquire new behavioral contingencies from experience, so that they can adapt to novel stimuli or changed contingencies. But brain plasticity declines as offspring become more independent – perhaps to make the lessons learned from parents more sticky and harder to overwrite with their own experience, such that progeny can benefit from the information gained by their ancestors’ survival of rare high-stakes events.

Possible application to alignment fine-tuning

I suspect it would be maximally effective to begin RLHF extremely early in training of LLMs, as soon as they can exhibit any identifiably good or bad behavior, and then periodically over the entire course of training. It might be far easier to steer or train models early in training because less information has already been embedded in their weights. Value-alignment that is reinforced early in training might be more “deeply embedded” in how all other information is acquired, perhaps making alignment more sticky and hard to override by later learning. Second -- and I think this is already the case but perhaps it could be pushed harder -- learning rate should be reduced ~~after models are released~~ over the course of training. The bulk of the weights of the model should be frozen, and for those that can be modified, ~~user-based~~ later alignment fine-tuning should be extremely slow.

2. Bridging/marking stimuli

Animal training insight

With animal training, if we reinforce (reward or punish) at the end of a trial on the basis of whether the overall behavior was desired or undesired, the animal might not be certain which of its actions was responsible for predicting the outcome (the behavioral analog of the credit assignment problem). Therefore, operant conditioning can be made more efficient by providing a bridging or marking stimulus that occurs precisely at the time of the desired action, or at the earliest of a sequence of desired actions. Then even if the real reward (such as treat) isn’t received until a few seconds later, it is still unambiguous to the animal which of its many behaviors was specifically responsible for the reinforcement. In dog training, clickers are used for this purpose. When we train rats, we use auditory beeps. We find that as long as beeps are reliable, the actual rewards can be intermittent and unpredictable and gradually reduced.

Possible application to alignment fine-tuning

Instead of giving thumbs-up or thumbs-down to an overall response to a prompt, one could provide token-level or segment-level feedback on where or when in the chain of thought the human trainer thinks the model did something right or wrong, or began to take a correct or wrong turn. The following point builds on this idea.

3. Escape behaviors

Animal training insight

In animal training, animals can get stuck in training after they get a punishment, because they know they did the wrong thing, but they still don’t know what the right thing would be. Trainers can help them get unstuck by teaching them “escape behaviors” or “correction behaviors”. Immediately after reprimanding them for a wrong response, they are shown what the correct response would be (for example, say ‘sit’ while pushing the puppy’s rear-end to the ground, and reward the dog for sitting, even though it was you who made it happen). Or immediately after a rat chooses the wrong “door” in an operant task and fails to get a food reward, show the rat which door the treat was in fact behind. Then immediately repeat the identical trial and give the animal a chance to perform the correct stimulus-contingent behavior you have just revealed. The animal experiences regret (negative reinforcement associated with the action it took) when it realizes it could have gotten a better reward, and also positive reinforcement for the desired action when he then performs it.

This is related to differential reinforcement in animal training. By immediately juxtaposing a negative reinforcement for an undesired action and a positive reinforcement for a desired action holding the context (cue) identical, the animal can more easily isolate exactly where its error was.

Possible application to alignment fine-tuning

Re-dos: during RLHF, immediately after punishing a bad response, let the model re-do the same prompt, or back it up to the point in its chain of thought just before it first started to go wrong, and let it continue again from there. Reward it if it makes a better response. This preferentially updates the weights specifically responsible for the error instead of all weights contributing to the overall output. Hopefully the model would then generalize better to novel prompts.
Providing alternatives: in addition to punishing bad behavior, suggest a preferable alternative. For example, instead of human trainers just giving a thumbs-down for an undesired output to a prompt, they could select which among the model’s top ten lower-probability or non-selected alternative outputs would have been better than the selected one. This seems promising because if a correct or desired response was already among the models top 10 alternatives, the network is already closer to that output than some arbitrary preferred one you might suggest.

4. Under-generalization

Animal training insight

Under-generalization is when an animal learns to perform the desired action in response to a cue, but it is insufficiently clear to the animal which part of the cue mattered, and he bundles irrelevant accidental stimuli as being part of the cue. Therefore, the behavioral response to the intended cue will not generalize when the other accidental cues are absent. For example, if you take a dog training class at a park, you might successfully train your dog to sit when you say “sit” in class, by giving him a cookie when he does so. But he can’t tell if the relevant cue was the park, the dog-training-class context, your identity, the cookie, or the word “sit”. So the learning won’t generalize. He may not sit when it’s not during dog training class, or when you say “sit” at home, or when another family member says “sit”, or when you don’t have a cookie in your hand.

To make learning generalize robustly, it is important to explicitly vary the irrelevant cues during training, to make it clear which cue is the relevant one. For dog training this means training at different locations, by different people, in different contexts, both with and without rewards. If you want the behavior to stably occur without rewards, you must overtrain and gradually wean off rewards. To make the behavior less brittle to competing drives, you need to train in the presence of increasingly difficult distractors. In dog training class this might look like obeying a “stay” command while another dog is running around, or while a cat walks by.

With animal training, it is important to continue reinforcing behavior (negatively and positively) in the real world, at least sporadically. This promotes generalization to the real world, where the behavior actually matters.

Possible application to alignment fine-tuning

This seems related to the problem of evaluation awareness, and also to the failure to generalize fine-tuned alignment to unanticipated end-user contexts.

The analogy to varying the training context for animals would be to do fine-tuning RLHF using a variety of prompts and scenarios, in varied settings, at random unpredictable times over the course of training, sometimes with and sometimes without reinforcements. Introduce random intentionally misdirecting cues during training. Do training in as close to real-world context as possible, if not in the real world. I think these things are already being done, but maybe they could be pushed harder.

The analogy to real-world training is to continue alignment training after deployment. Assuming there is no post-release RLHF, this might look like shipping models with an inbuilt, non-malleable trusted teacher (superego) that monitors its chain of thought and outputs and provides sporadic reinforcement feedback, on a permanent basis (reinforcement learning from AI feedback, RLAIF). This seems like the analog of having some evolutionarily hard-wired human-value-aligned behaviors. Advantages: retains maximum AI-creator control over the behaviors to be reinforced or prohibited. Continued fine-tuning in the real world is most likely to generalize to the real world, so this may also address undergeneralization. Disadvantages: Inflexible, could be hacked/jailbroken. People will not like the idea of the creator of the model having sole control over which specific values the model should be aligned to.^[1]

However, to ensure generalization of intended alignment tuning to unanticipated contexts (including intentional misalignment by prompt engineering/jailbreaking), it would be necessary to continue reinforcement-based alignment fine-tuning after models are released, in response to real-world use scenarios as they arise. In addition to the "superego" mentioned above, some ideas for this might be:

If models can only be used on the provider’s platform, the provider can continuously impose sporadic RLAIF and/or RLHF on ~~all user-tuned instantiations~~ the base model, using any mixture of trusted models and human trainers. Advantages over the built-in superego: can respond to unanticipated contexts and newly discovered vulnerabilities. Disadvantage: this still gives the model provider sole control over what values should be trained.

If models can only be used on the provider’s platform, the provider can implement a federated collective RLHF component, whereby users provide feedback on appropriateness of outputs, and the malleable subset of the model’s weights are subject to alignment-based feedback from all users in a broader community.

5. Over-generalization

Animal training insight

Over-generalization is the other failure mode when it is not sufficiently clear to the animal what parts of a cue matters, but in the other direction – failing to discern aspects of the cue that do matter. For a dog, this might look like sitting whenever any verbal command is given. For a bird that eats a monarch butterfly and gets sick, it might learn to avoiding eating any insect, even though it’s only monarch butterflies that make it sick. To prevent overgeneralization one can use differential reinforcement, where within the same training session you present a cue that you want to elicit one response (rewarding that response), and also the nearest related cue that you do not want to elicit that response (punishing that response, or rewarding some different response). You need to do this before the learned response is over-learned, however, or you won’t get opportunities to reward the opposite-of-trained behavior in the contrasting context.

Possible application to alignment fine-tuning

This seems self-evident.

6. Extinction

Animal training insight

One of the biggest problems with animal training is that learned behaviors are gradually unlearned, due to the absence of ongoing reinforcement after training ends. If the behavior is never rewarded or punished again, the animal may correctly update its world model to reflect that the learned reward-contingency of its behavior no longer applies. This is called “extinction”. For example, a puppy is taught not to jump up on people. At home, nobody enforces this. Over time, the dog reverts to its tendency to jump up. More extremely, training can be extinguished by new learning from subsequent reinforcements in the opposite direction. For example: a puppy is taught not to steal food from the dinner table, and this is pretty well enforced. Then one day as an adult he succumbs to an overwhelming temptation and steals the entire Thanksgiving turkey while nobody is home. BIIIIG reward, may wipe out effects of all past learned punishments. Retraining will be required.

Guilty Dog stealing food — Extinction: even well-trained animals will lose training if not reinforced over the lifespan

Possible application to alignment fine-tuning

If and to the extent there is user-specific fine-tuning of weights after deployment, it may be especially important to continue imposing user-independent alignment fine-tuning training after deployment, to avoid extinction of the provider's original alignment training.

General ideas on how one might continue re-training models on provider-specified and/or community-imposed standards are discussed in 4. Under-generalization. Here I wanted to address the further concern that restricting the degree of user-specific fine-tuning might cause a perceived reduction in usefulness/performance. Perhaps this could be mitigated by a mixture of short-term-memory (STM) and long-term memory (LTM).

For example, the modifiable weights of the model could be a linear combination of three components: the weight value at the time of deployment (the analog of genetically determined wiring + parentally taught updates), +/- an STM modifier term that can be updated with a high learning rate based on user feedback, but which decays to zero with a short time constant, +/- an LTM modifier term that can be updated based on both RLAIF and user feedback with a slow learning rate, decaying to zero with a very long time constant.

The STM component might allow models to adapt quickly to user preferences on the fly, without those highly-flexible updates persisting and accumulating. Sort of like an animal figuring out "at today's backyard BBQ, it turns out I can get away with stealing food, but this probably only applies today". The LTM component might allow models to drift toward individual user preferences cumulatively, but much more slowly over extended use. This would be like the animal figuring out from many such experiences, "at large outdoor parties in general, you can get away with a lot". But the non-modifiable weight component might prevent those changes from ever fully overriding the training baked in at time of deployment (the dog still won't steal from the dining table). The idea is to make "innate" and "parentally taught" alignment more sticky and hard to override by individual users, for the same reason brain plasticity declines by adolescence.

Concluding remarks

My goal here was to summarize some principles animal trainers use, and think about what the analog might be in alignment fine-tuning. I realize some of these may already be well-known or widely-used. There are many other training hacks I left out either because they are already seem to be very widely used and appreciated, or because they seem less promising, including: shaping sequences, forward- and backward-chaining, scalar reinforcement, "jackpots" (extra-strong reinforcements for breakthroughs or rare desired behaviors), and differential context training.

I'm not an AI or alignment fine-tuning researcher myself, but based on my very limited knowledge so far (i.e. what I could glean by attending the SD Alignment Workshop and NeurIPS this year), here's what I would be most inclined to investigate further:

First: try early alignment training. Test whether incorporating very early RLHF (for pro-sociality, corrigibility, and/or epistemic virtue) in the initial training of LLM models has payoffs in the robustness of alignment along those axes later on.

Second: try marking/escape behavior training. Test whether RLHF works better if instead of just giving a thumbs-down for bad outputs, human trainers identified the point in COT where models started to go wrong (marking), identified which among the highest-probability alternative outputs the model considered at that point would have been preferable (escape behavior), and then re-tested the model on the same prompt before continuing (re-do).

Third: post-release alignment training. Work on coming up with a sound proposal for ongoing, partly-federated post-deployment continual fine-tuning. At the moment I am more concerned about philosophical/ethical/political complexities than technical implementation barriers.

^{^}
Personally, I think the individual or corporate creators of AIs have the right to refuse to allow their product to be used for purposes they oppose, and to enforce this if they are technically able to do so. That might mean embedding red-line prohibitions in a deployed model that cannot be lifted by any amount of training (like Asimov’s Robot Code). As long as rules are transparently endorsed, users can boycott models whose restrictions they cannot accept. Countries in which an AI technology is created probably also have the right to restrict its export use to other countries, for military reasons. Treaties might enforce international agreements. So I can imagine a few layers of justifiable imposed values.

[-]Brendan Long2mo20

Because models continue to learn after deployment

I think this is currently not true, although learning after deployment is a capability I assume all of the labs are working on.

Thanks for writing this, it's really interesting!

[-]foodforthought2mo30

What I meant by that is that individual instantiations adapt to individual users' preferences, i.e. they develop personas that can veer off into bad/less aligned directions over extended use. If this is the case, what is this called if not "learning after deployment"?

I find this confusing to think about, but I think if you mean within a single conversation, then yeah, the persona can plausibly learn and adapt.

[-]foodforthought2mo10

ok, I thought this was a bigger concern than it appears to be, I'll edit accordingly.

LESSWRONG
LW

LESSWRONG
LW

5

Alignment Fine-Tuning: Lessons from Operant Conditioning

5

1. Train when young

Animal training insight

Possible application to alignment fine-tuning

2. Bridging/marking stimuli

Animal training insight

Possible application to alignment fine-tuning

3. Escape behaviors

Animal training insight

Possible application to alignment fine-tuning

4. Under-generalization

Animal training insight

Possible application to alignment fine-tuning

5. Over-generalization

Animal training insight

Possible application to alignment fine-tuning

6. Extinction

Animal training insight

Possible application to alignment fine-tuning

Concluding remarks

5