I am a neuroscientist with many years of experience applying and improving operant conditioning methods for training animals (mostly mice and rats). This concept has been covered by others here and elsewhere, but here is my version of how to explain it.
I am grateful to Alex Turner (@turntrout) for feedback on an earlier draft of this post.
What is operant conditioning?
Animals learn from experience what behaviors will lead to preferred outcomes in any given context. This is predicated on some deeper general truths:
animals generally have more than one actionthey can take in a given situation
different actions change the probabilities of different outcome states
different outcome states have different expected values[1] for the animal, both in the sense that some outcomes are preferred by the animal over others, and in the sense that some are in fact better or worse for the animal's survival than others[2].
the probabilities of different outcomes of a given action are context-dependent
sensory signals carry information (cues) about the probable outcomes of potential actions in a given context
some sensory cues predict the survival-significance of outcomes given actions sufficiently stably (the prediction applies always and everywhere) and universally (all agents would have the same preferred outcomes) that default actions in response to those cues (reflexes) have been embedded as hardwired behavioral responses over evolution (e.g. blink reflex, withdrawal reflex).
more often, the multi-step mapping (from sensory cues, to predicted outcomes of potential actions, to expected values of those outcomes) is sufficiently variable over space and/or time or unique to individual circumstances that it is more adaptive to be able to update one’s world model of which action is best given which sensory cues, on the basis of lived experience.
Therefore, animals constantly update an implicit world model of the probabilities of potential survival-relevant outcomes of their potential actions given recent/current sensory signals, in such a way that increases the probability of executing those behaviors that statistically tend to lead to desired outcomes (positive reinforcement) and decreases the probability of executing behaviors that lead to undesired outcomes (negative reinforcement). I do not mean to imply that these are conscious mental operations. Animals as simple as fruit flies or sea slugs learn by operant conditioning, and humans can be conditioned to alter their behavior without realizing this has occurred.
An example of operant conditioning in nature
Agent: Naive Bird
Current state: Hungry
Subjective values of potential future states:
V(hungry) = -1
V(full) = +1
Stimulus cue: orange-and-black butterfly
Bird's estimated outcome probabilities contingent on available actions, given stimulus:
Possible Action p(hungry|action) p(full|action)
Eat butterfly 0 1
Don’t eat it 1 0
Bird’s expected value of its future state contingent on any given action is EVaction=∑outcomesP(outcome|action)V(outcome). Thus:
Possible Action EV
Eat butterfly +1
Don’t eat it -1
Bird selects action that maximizes expected value after action: eat the butterfly
Actual outcome: retch
Bird's updated subjective values of potential future states:
V(hungry) = -1
V(full) = +1
V(retch) = -10
Bird's updated model of reward contingencies of actions given stimulus = butterfly:
Possible Action p(hungry|action) p(full|action) p(retch|action) EV|action
Eat butterfly 0 0 1 -10
Don’t eat it 1 0 0 -1
The next time the bird sees an orange-and-black butterfly he won’t eat it, even if he’s hungry. Aversion to ingesting things that preceded being violently ill in the past is usually a one-shot learning event, and extremely difficult to override.
Training animals by Operant Conditioning
Training by operant conditioning hijacks these natural learning mechanisms for human purposes. In this method, we artificially contrive an animal’s environment such that a certain cue or stimulus reliably predicts outcomes of the animal’s potential behaviors, pairing innately preferred or non-preferred outcomes with its actions in such a way that reinforces (increases or decreases the probability of) behaviors according to our preferences. If successful, we can then elicit the desired behavior by providing the cue even without the reward; or if it is a naturally occurring cue, that cue will come to elicit the desired behavior instead of whatever other behavior was innate or had previously been learned.
For example, if we contrive the world so that the word “sit” reliably predicts that one of a dog’s potential actions (sitting) will lead to an outcome the dog desires (treat, praise), whereas any other action in that context fails to yield that desired outcome, or yields an undesired outcome, the dog will learn by trial and error that in the presence of that sensory cue (sound of the word “sit”) the action (sitting) is, among its many options, the one that optimizes its post-action value state, and the dog will then be more likely to select that action in response to that cue.
This works well to the extent the cue is reliably predictive, the animal actually prefers the outcome we pair with the desired action (or dislikes the outcome we pair with the undesired action); and the animal successfully associates its action as having been the cause of the outcome. The failure modes mostly have to do with breaking one or more of those requirements.
Coda
Operant conditioning (discovered by B.F. Skinner in the 1930s) is different from classical conditioning (discovered by Pavlov in the 1890s by Ivan Pavlov).
Pavlov's dog gets food after a bell rings, no matter what the dog does.
In classical conditioning, animals learn to associate an outcome with a cue that consistently precedes or coincides with it. In Pavlov's famous experiment, a bell predicted appearance of food. After learning this, the animal changes their behavior as a result of this association (e.g., the dogs drool when they hear a bell), but the outcome (food appearing) occurs regardless of the animal’s behavior. By contrast, in operant conditioning the cue predicts what outcome will occur contingent on the animal performing a specific action. If the action is not performed, the outcome does not occur.
Skinner's pigeons will get a food pellet only if they peck the correct location. The visual cue informs them which location is correct in that trial.
Due to almost 100 years of formal operant conditioning research, combined with a much longer history of humans training domesticated animals, we have a good understanding of how this works, and many effective hacks to optimize operant conditioning success. In a follow-up post, I will speculate on how this knowledge might be usefully applied in the context of alignment tuning of LLMs or other AI models.
In this post I am using the word "value" in the sense of something being good (positive value) or bad (negative value) for survival of the animal, or at least assessed as such; closely related to "utility". I don't mean "human values" or "moral values", although I take those to be a special case of this general usage.
I am a neuroscientist with many years of experience applying and improving operant conditioning methods for training animals (mostly mice and rats). This concept has been covered by others here and elsewhere, but here is my version of how to explain it.
I am grateful to Alex Turner (@turntrout) for feedback on an earlier draft of this post.
What is operant conditioning?
Animals learn from experience what behaviors will lead to preferred outcomes in any given context. This is predicated on some deeper general truths:
Therefore, animals constantly update an implicit world model of the probabilities of potential survival-relevant outcomes of their potential actions given recent/current sensory signals, in such a way that increases the probability of executing those behaviors that statistically tend to lead to desired outcomes (positive reinforcement) and decreases the probability of executing behaviors that lead to undesired outcomes (negative reinforcement). I do not mean to imply that these are conscious mental operations. Animals as simple as fruit flies or sea slugs learn by operant conditioning, and humans can be conditioned to alter their behavior without realizing this has occurred.
An example of operant conditioning in nature
Agent: Naive Bird
Current state: Hungry
Subjective values of potential future states:
Stimulus cue: orange-and-black butterfly
Bird's estimated outcome probabilities contingent on available actions, given stimulus:
Bird’s expected value of its future state contingent on any given action is EVaction=∑outcomesP(outcome|action)V(outcome). Thus:
Bird selects action that maximizes expected value after action: eat the butterfly
Actual outcome: retch
Bird's updated subjective values of potential future states:
Bird's updated model of reward contingencies of actions given stimulus = butterfly:
The next time the bird sees an orange-and-black butterfly he won’t eat it, even if he’s hungry. Aversion to ingesting things that preceded being violently ill in the past is usually a one-shot learning event, and extremely difficult to override.
Training animals by Operant Conditioning
Training by operant conditioning hijacks these natural learning mechanisms for human purposes. In this method, we artificially contrive an animal’s environment such that a certain cue or stimulus reliably predicts outcomes of the animal’s potential behaviors, pairing innately preferred or non-preferred outcomes with its actions in such a way that reinforces (increases or decreases the probability of) behaviors according to our preferences. If successful, we can then elicit the desired behavior by providing the cue even without the reward; or if it is a naturally occurring cue, that cue will come to elicit the desired behavior instead of whatever other behavior was innate or had previously been learned.
For example, if we contrive the world so that the word “sit” reliably predicts that one of a dog’s potential actions (sitting) will lead to an outcome the dog desires (treat, praise), whereas any other action in that context fails to yield that desired outcome, or yields an undesired outcome, the dog will learn by trial and error that in the presence of that sensory cue (sound of the word “sit”) the action (sitting) is, among its many options, the one that optimizes its post-action value state, and the dog will then be more likely to select that action in response to that cue.
This works well to the extent the cue is reliably predictive, the animal actually prefers the outcome we pair with the desired action (or dislikes the outcome we pair with the undesired action); and the animal successfully associates its action as having been the cause of the outcome. The failure modes mostly have to do with breaking one or more of those requirements.
Coda
Operant conditioning (discovered by B.F. Skinner in the 1930s) is different from classical conditioning (discovered by Pavlov in the 1890s by Ivan Pavlov).
In classical conditioning, animals learn to associate an outcome with a cue that consistently precedes or coincides with it. In Pavlov's famous experiment, a bell predicted appearance of food. After learning this, the animal changes their behavior as a result of this association (e.g., the dogs drool when they hear a bell), but the outcome (food appearing) occurs regardless of the animal’s behavior. By contrast, in operant conditioning the cue predicts what outcome will occur contingent on the animal performing a specific action. If the action is not performed, the outcome does not occur.
Due to almost 100 years of formal operant conditioning research, combined with a much longer history of humans training domesticated animals, we have a good understanding of how this works, and many effective hacks to optimize operant conditioning success. In a follow-up post, I will speculate on how this knowledge might be usefully applied in the context of alignment tuning of LLMs or other AI models.
In this post I am using the word "value" in the sense of something being good (positive value) or bad (negative value) for survival of the animal, or at least assessed as such; closely related to "utility". I don't mean "human values" or "moral values", although I take those to be a special case of this general usage.
In this context this distinction is not critical, but in an in-depth analysis of goal-directed action of agents it is important to distinguish them.