Matt Botvinick on the spontaneous emergence of learning algorithms

by Adam Scholl4 min read12th Aug 202088 comments


Ω 45

Mesa-OptimizationNeuroscienceNeocortexMachine LearningInner AlignmentEmergent BehaviorAI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:

Initial Observation

At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of... emerges, out of, like out of thin air”:

"What happens... it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.”

Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data.

From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:

“'s something that just happens. In a sense, you can't avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can't stop it."

Search for Biological Analogue

This system reminded some of the neuroscientists in Botvinick’s group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of “synaptic memory” and “activity-based memory.” They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks.

They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments.

As I understand it, their story goes as follows:

The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value.

DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC’s synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of “a second RL algorithm, implemented entirely in the prefrontal network’s activation dynamics.” That is, they think DA mostly stores its model in synaptic memory, while PFC mostly stores it in activity-based memory (i.e. directly in the dopamine distributions).[2]

What’s the case for this story? They cite a variety of neuroscience findings as evidence for parts of this hypothesis, many of which involve doing horrible things to monkeys, and some of which they simulate using their meta-RL model to demonstrate that it gives similar results. These points stood out most to me:

Does RL occur in the PFC?

Some scientists implanted neuroimaging devices in the PFCs of monkeys, then sat the monkeys in front of two screens with changing images, and rewarded them with juice when they stared at whichever screen was displaying a particular image. The probabilities of each image leading to juice-delivery periodically changed, causing the monkeys to update their policies. Neurons in their PFCs appeared to exhibit RL-like computation—that is, to use information about the monkey’s past choices (and associated rewards) to calculate the expected value of actions, objects and states.

Wang et al. simulated this task using their meta-RL system. They trained a RNN on the changing-images task using RL; when run, it apparently demonstrated similar performance as the monkeys, and when they inspected it they found units that similarly seemed to encode EV estimates based on prior experience, continually adjust the action policy, etc.

Interestingly, the system continued to improve its performance even once its weights were fixed, which they take to imply that the learning which led to improved performance could only have occured within the activation patterns of the recurrent network.[3]

Can the two RL algorithms diverge?

When humans perform two-armed bandit tasks where payoff probabilities oscillate between stable and volatile, they increase their learning rate during volatile periods, and decrease it during stable periods. Wang et al. ran their meta-RL system on the same task, and it varied its learning rate in ways that mimicked human performance. This learning again occurred after weights were fixed, and notably, between the end of training and the end of the task, the learning rates of the two algorithms had diverged dramatically.


The account detailed by Botvinick and Wang et al. strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong. [Edit: Note that some commenters, like Rohin Shah and Evan Hubinger, disagree].

These researchers did not set out to train RNNs in such a way that they would turn into reinforcement learners. It just happened. And the researchers seem to think this phenomenon will occur spontaneously whenever “a very general set of conditions” is met, like the system having memory, being trained via RL, and receiving a related sequence of inputs. Meta-RL, in their view, is just “an emergent effect that results when the three premises are concurrently satisfied... these conditions, when they co-occur, are sufficient to produce a form of ‘meta-learning’, whereby one learning algorithm gives rise to a second, more efficient learning algorithm.”

So on the whole I felt alarmed reading this. That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures. Maybe this is just because of some non-scalable hack, like that the systems involved aren’t very powerful optimizers.[5] But I wouldn't be surprised if coming to better understand the biological mechanisms involved led to safety-relevant insights.

Thanks to Rafe Kennedy for helpful comments and feedback.

  1. The authors hypothesize that DA is a model-free RL algorithm, and that the spinoff (mesa?) RL algorithm it creates within PFC is model-based, since that’s what happens in their ML model. But they don’t cite biological evidence for this. ↩︎

  2. Depending on what portion of memories are encoded in this way, it may make sense for cryonics standby teams to attempt to reduce the supraphysiological intracellular release of dopamine that occurs after cardiac arrest, e.g. by administering D1-receptor antagonists. Otherwise entropy increases in PFC dopamine distributions may result in information loss. ↩︎

  3. They demonstrated this phenomenon (continued learning after weights were fixed) in a variety of other contexts, too. For example, they cite an experiment in which manipulating DA activity was shown to directly manipulate monkeys’ reward estimations, independent of actual reward—i.e., when their DA activity was blocked/stimulated while they pressed a lever, they exhibited reduced/increased preference for that lever, even if pressing it did/didn’t give them food. They trained their meta-RL system to simulate this, again demonstrated similar performance as the monkeys, and again noticed that it continued learning even after the weights were fixed. ↩︎

  4. The authors seem unsure whether meta-RL also occurs in other brain regions, since for it to occur you need A) inputs carrying information about recent actions/rewards, and B) network dynamics (like recurrence) that support continual activation. Maybe only PFC has this confluence of features. Personally, I doubt it; I would bet that meta-RL (and other sorts of mesa-optimization) occur in a wide variety of brain systems, but it would take more time than I want to allocate here to justify that intuition. ↩︎

  5. Although note that neuroscientists do commonly describe the PFC as disproportionately responsible for the sort of human behavior one might reasonably wish to describe as “optimization.” For example, the neuroscience textbook recommended on lukeprog’s textbook recommendation post describes PFC as “often assumed to be involved in those characteristics that distinguish us from other animals, such as self-awareness and the capacity for complex planning and problem solving.” ↩︎


Ω 45