I am a neuroscientist with many years of experience applying and improving operant conditioning methods for training animals (mostly mice and rats). This concept has been covered by others here and elsewhere, but here is my version of how to explain it.
I am grateful to Alex Turner (@turntrout) for feedback on an earlier draft of this post.
What is operant conditioning?
Animals learn from experience what behaviors will lead to preferred outcomes in any given context. This is predicated on some deeper general truths:
Therefore, animals constantly update an implicit world model of the probabilities of potential survival-relevant outcomes of their potential actions given recent/current sensory signals, in such a way that increases the probability of executing those behaviors that statistically tend to lead to desired outcomes (positive reinforcement) and decreases the probability of executing behaviors that lead to undesired outcomes (negative reinforcement). I do not mean to imply that these are conscious mental operations. Animals as simple as fruit flies or sea slugs learn by operant conditioning, and humans can be conditioned to alter their behavior without realizing this has occurred.
An example of operant conditioning in nature
Agent: Naive Bird
Current state: Hungry
Subjective values of potential future states:
V(hungry) = -1
V(full) = +1Stimulus cue: orange-and-black butterfly
Bird's estimated outcome probabilities contingent on available actions, given stimulus:
Possible Action p(hungry|action) p(full|action)
Eat butterfly 0 1
Don’t eat it 1 0 Bird’s expected value of its future state contingent on any given action is . Thus:
Possible Action EV
Eat butterfly +1
Don’t eat it -1 Bird selects action that maximizes expected value after action: eat the butterfly
Actual outcome: retch
Bird's updated subjective values of potential future states:
V(hungry) = -1
V(full) = +1
V(retch) = -10Bird's updated model of reward contingencies of actions given stimulus = butterfly:
Possible Action p(hungry|action) p(full|action) p(retch|action) EV|action
Eat butterfly 0 0 1 -10
Don’t eat it 1 0 0 -1 The next time the bird sees an orange-and-black butterfly he won’t eat it, even if he’s hungry. Aversion to ingesting things that preceded being violently ill in the past is usually a one-shot learning event, and extremely difficult to override.
Training animals by Operant Conditioning
Training by operant conditioning hijacks these natural learning mechanisms for human purposes. In this method, we artificially contrive an animal’s environment such that a certain cue or stimulus reliably predicts outcomes of the animal’s potential behaviors, pairing innately preferred or non-preferred outcomes with its actions in such a way that reinforces (increases or decreases the probability of) behaviors according to our preferences. If successful, we can then elicit the desired behavior by providing the cue even without the reward; or if it is a naturally occurring cue, that cue will come to elicit the desired behavior instead of whatever other behavior was innate or had previously been learned.
For example, if we contrive the world so that the word “sit” reliably predicts that one of a dog’s potential actions (sitting) will lead to an outcome the dog desires (treat, praise), whereas any other action in that context fails to yield that desired outcome, or yields an undesired outcome, the dog will learn by trial and error that in the presence of that sensory cue (sound of the word “sit”) the action (sitting) is, among its many options, the one that optimizes its post-action value state, and the dog will then be more likely to select that action in response to that cue.
This works well to the extent the cue is reliably predictive, the animal actually prefers the outcome we pair with the desired action (or dislikes the outcome we pair with the undesired action); and the animal successfully associates its action as having been the cause of the outcome. The failure modes mostly have to do with breaking one or more of those requirements.
Coda
Operant conditioning (discovered by B.F. Skinner in the 1930s) is different from classical conditioning (discovered by Pavlov in the 1890s by Ivan Pavlov).
In classical conditioning, animals learn to associate an outcome with a cue that consistently precedes or coincides with it. In Pavlov's famous experiment, a bell predicted appearance of food. After learning this, the animal changes their behavior as a result of this association (e.g., the dogs drool when they hear a bell), but the outcome (food appearing) occurs regardless of the animal’s behavior. By contrast, in operant conditioning the cue predicts what outcome will occur contingent on the animal performing a specific action. If the action is not performed, the outcome does not occur.
Due to almost 100 years of formal operant conditioning research, combined with a much longer history of humans training domesticated animals, we have a good understanding of how this works, and many effective hacks to optimize operant conditioning success. In a follow-up post, I will speculate on how this knowledge might be usefully applied in the context of alignment tuning of LLMs or other AI models.
In this post I am using the word "value" in the sense of something being good (positive value) or bad (negative value) for survival of the animal, or at least assessed as such; closely related to "utility". I don't mean "human values" or "moral values", although I take those to be a special case of this general usage.
In this context this distinction is not critical, but in an in-depth analysis of goal-directed action of agents it is important to distinguish them.