The backstory: Dopamine-supervised learning in mammals
Everyone knows about one thing that a dopamine signal can do in a mammal brain: it can serve as a Reward Prediction Error signal that drives reinforcement learning, i.e. finding good actions within a high-dimensional space of possible actions.
I strongly agree that this is one of the things that dopamine does.
But meanwhile I've been advocating that dopamine also plays a different role in other parts of the mammal brain: a dopamine signal can provide the supervisory signal for supervised learning in a one-dimensional action space—see Big Picture of Phasic Dopamine.
(Or 47 dopamine signals can supervise learning in a 47-dimensional action space, etc. etc.)
The parts of the brain where I think dopamine-supervised learning is happening is more-or-less (1) parts of the amygdala, and (2) the agranular parts of medial prefrontal cortex. The ventral striatum is also involved—again see Big Picture of Phasic Dopamine. Anyway, I think these parts of the brain house dozens-to-hundreds of copies of the same supervised learning algorithm, each sending its output down to the hypothalamus & brainstem, and each in turn receiving its own individual supervisory dopamine signal coming back up from the brainstem.
For example, one of the supervised learning circuits might have an output line to the hypothalamus & brainstem whose signals mean: “We need to salivate now!” The brainstem then has a dopamine supervisory signal going back up to that circuit, which it uses to correct those suggestions with the benefit of hindsight. Thus, if I suddenly find myself with a mouth chock-full of salt, then in hindsight, I should have been salivating in advance. The brainstem knows that I should have been salivating—the brainstem has a direct input from the taste buds—and thus the brainstem can put that information into the dopamine supervisory signal.
Here I’ll zoom out a bit so that we can see both kinds of dopamine signals, the reinforcement learning dopamine, and the supervised learning dopamine:
So that’s my hypothesis about dopamine-supervised learning in mammals. I think there are good reasons to believe it (again see here), but the reasons are all kinda indirect and suggestive. I don't have a super solid case.
So, I was delighted when a friend (Adam Marblestone) sent me crystal-clear evidence of dopamine-supervised learning!
The only catch was … it was in drosophila! So not exactly what I was looking for. But still cool!!
By the way, nobody ever told me, but right now is apparently a golden era in drosophila research—pretty much all 135,000 drosophila neurons have been mapped, or will be in the near future. See “Drosophila Connectome” on wikipedia. For my part, I’m very far from a drosophila expert—pretty much everything I know about drosophila comes from “The connectome of the adult Drosophila mushroom body provides insights into function”, Li et al. 2020, and also Larry Abbott’s 2018 talk which is nice and pedagogical. Also, I have some drosophila living in my compost bin at home. So yeah, I’m not an expert. But I’ll do my best, and please let me know if you see any errors.
Obvious question: If I’m correct that there’s dopamine-supervised learning in mammals … and if there’s also dopamine-supervised learning in drosophila … could they be homologous?? I don’t know! I guess it’s possible. But it could also be a coincidence (convergent evolution). Anyway, it's at least an interesting point of comparison, and I think worth the time to learn about, even if you (like me) are ultimately only interested in humans. (Or AIs.)
Algorithmic background—cerebellum-style supervised learning
Ironically, after getting you all excited about that possible homology, I will now immediately switch from supervised learning in the mammalian cortex to supervised learning in the mammalian cerebellum. The cerebellum does not use dopamine as a supervisory signal—it uses climbing fibers instead. I think any resemblance between drosophila and the cerebellum is almost definitely convergent evolution. But still, it’s a clean and straightforward resemblance, whereas cortex has a more complicated learning algorithm that I don’t want to get into here.
So, let’s talk about the cerebellum. As far as I can tell, the cerebellum is kinda like a giant memoization system: it watches the activity of other parts of the brain (including parts of the neocortex and amygdala, and maybe other things too), it memorizes patterns in what signals those parts of the brain send under different circumstances, and when it learns such a pattern, it starts sending those same signals itself—just earlier. (See my post here for more.)
How does it do that?
The basic idea is: we have a bunch of "context" lines carrying information about various different aspects of what's happening in the world. The more different context lines, the better, by and large—the algorithm will eventually find the lines bearing useful predictive information, and ignore the rest. Then there’s one or more pairs of (output signal, supervisor signal). The learning algorithm’s goal is for each output to reliably fire a certain amount of time before the corresponding supervisor.
Here's one way that something like this might work. Let’s say I want Output 1 to fire 0.5 seconds before Supervisor 1. Well, for each context signal, I can track (1) how likely it is to fire 0.5 seconds before Supervisor 1; and (2) how likely it is to fire in general. If that ratio is high, then I’ve found a good predicting signal! So I would edit the synapse strength between that context signal and the output line. (This is just a toy example; see Fine Print at the end.)
Here’s a diagram with a bit more detail:
The new thing that this diagram adds—besides the anatomical labels—is the “pattern separation” step at the left. Basically, there’s a trick where you take some context lines, randomly combine them in tons of different ways, sprinkle in some nonlinearity, and voila, you have way more context lines than you started with! This enables the system to learn a wider variety of possible patterns in a single neural-network layer. This is the function of the tiny granule cells in the cerebellum, which famously comprise more than half of neurons in the human brain.
Now let’s talk about drosophila!
The drosophila equivalent of the pattern-separating cerebellar granule cells is “Kenyon Cells” (KCs), which take the ~35-dimensional space of detectable odors and turn it into a ~1000-dimensional space of odor patterns (ref). The axons of these cells are the “context lines” that our learning algorithms will sculpt into a predictive model. More recently, Li et al. 2020 found Kenyon Cells with other kinds of context information besides odor, including visual information, temperature, and taste. These were (at least partly) segregated—this allows the genome to, say, train a model that makes predictions based on odor information, and also train a model that makes predictions based on visual information, and then give one of those two models a veto over the other. That’s just a made-up example, but I can imagine things like that being useful.
The Kenyon Cell axons, carrying context information, form a big bundle of parallel fibers, called the “mushroom body”. (Eventually this splits into a handful of smaller bundles of parallel fibers.)
That brings us to the Mushroom Body Output Neurons (MBONs)—the supervised learning model output lines, the drosophila equivalent of Purkinje cells in the cerebellum. The synapses between the context lines and the MBON are edited by the learning algorithm. And as far as I can tell, the point of this system is just like the cerebellum: the MBON signals (i.e., the outputs from this trained model) will learn to approximate the supervisory signal, but shifted a bit earlier in time. (It could also be sign-flipped, and there are other complications—see “Fine Print” section below.) The “shifted earlier in time” allows the fly to predict problems and opportunities, instead of merely reacting to them. Otherwise there would hardly be any point in going to all this effort! After all, we already have the supervisory signal!
(Well, OK, supervised learning is good for a couple other things besides time-shifting—see the novelty-detection example below—but I suspect that time-shifting is the main thing here.)
Li et al. 2020 also found that there were “atypical” MBONs that connected to not only Kenyon Cells but also a grab-bag of other signals in the brain. I figure we should think of these as just even more context signals. Again, by and large, the more different context signals, the better the trained model!
They also found some MBON-to-MBON connections. If any of those synapses are plastic, I would just assume it’s the same story: one MBON is just serving as yet another context line for another MBON. (I surmise that the recurrent connections in the cerebellum are there for the same reason.) Are the synapses plastic though? I hear that it’s unknown, but that smart money says “probably not plastic”. So in that case, I guess the MBONs are probably gating each other, or doing some other such logical operation.
Finally, the last ingredient is the supervisory signal. Each supervised learning algorithm (MBON) has its own supervisory dopamine signal. (Well, more or less—see below.) In other words, dopamine is playing pretty much the same role in fruit flies as I was thinking for the mammal amygdala / agranular PFC / etc. Very cool!
I oversimplified a couple things above for readability; here are more details.
1. Dopamine could have the opposite sign, and could be either ground truth or error, I dunno.
I’ve been talking in broad terms about dopamine being a supervisory signal. There are various ways it could work in detail. For example, dopamine could be either an “error signal” (the circuit fired too much / too little), or a “ground truth signal” (the circuit should have fired / should have not fired). That depends on whether the subtraction is upstream or downstream of the dopamine neurons. I don’t know which is right, and anyway it doesn’t matter for the high-level discussion here.
Also, the dopamine could be the opposite sign, saying “you should NOT have fired just now”.
1A. Novelty-detector example.
…In fact, opposite-sign dopamine is probably more likely, because we have at least one example where it definitely works that way—see Hattori et al. 2017:
- The fruit-fly “MBON-α’3” neuron sends outputs which we can interpret as: “I need to execute an alerting response!!”
- The corresponding PPL1-α’3 dopamine supervisory signal can be interpreted as “Things are fine! There was no need to execute an alerting response just now!”
What would happen, for the sake of argument, if the output neuron were simply wired directly to the dopamine neuron? It would turn into a novelty detector, right? Think about it: Whatever odor environment it’s in, the supervisory signal gradually teaches it that this odor environment does not warrant an alerting response. Pretty cool!
(Of course, to get a novelty detector, you also need some maintenance processes that gradually resets the synapses back to some trigger-happy baseline. Otherwise it would eventually learn to never fire at all.)
2. Dopamine is involved in inference, not just learning
In Big Picture of Phasic Dopamine, I suggested that dopamine plays a role in both the learning algorithm (what to do in similar situations in the future) and the inference algorithm (what to do right now). I used this fun diagram as an example:
…But my discussion there was for reinforcement learning. When I then moved on to discussing supervised learning, I only talked about dopamine as a part of the learning algorithm, not the inference algorithm.
But it turns out that, at least in fruit flies, some of the supervised learning modules have dopamine signals playing inference-algorithm roles too—in particular there can be dopamine signals that function as a modulator / gate on the circuit. This was clearly documented in Cohn et al. 2015—see Fig. 6, where the ability of KC firing to trigger MBON firing is strongly modulated by the presence or absence of a certain dopamine signal under experimental control. Another example is Krashes et al. 2009, which showed that flipping certain hunger-related dopamine neurons on and off made the fruit fly start and stop taking actions based on learned food-related odor preferences. Both these papers demonstrated an inference-time gating effect, but I would certainly presume that the same signal gates the learning algorithm too. I’m getting all these references from Larry Abbott’s talk by the way.
3. I oversimplified the cerebellum, sorry
The mammalian cerebellum has a bunch more bells and whistles wrapped around that core algorithm, that are not present in drosophila and are not particularly relevant for this post. For example, there are various preprocessing steps on the context lines (ref), the Purkinje cells are actually a hidden layer rather than the output (ref), there are dynamically-reconfigurable oscillation modules for perfecting the timing, etc. Also, as with dopamine above, I don't know whether the climbing fiber signals are ground truth vs errors, and they may also be sign-flipped. Just wanted to mention those things for completeness.
(Thanks Jack Lindsey & Adam Marblestone for critical comments on earlier drafts.)