(Update Jan. 2023: This article has important errors. I am leaving it as-is in case people want to see the historical trail of me gradually making progress. Most of the content here is revised and better-explained in my later post series Intro to Brain-Like AGI Safety, see especially Posts 5 and 6. Anyway, I still think that this post has lots of big kernels of truth, and that my updates since writing it have mostly been centered around how that big picture is implemented in neuroanatomy. More detail in this comment.)
Target audience: Everyone, particularly (1) people in ML/AI, and (2) people in neuroscience. I tried hard to avoid jargon and prerequisites. You can skip the parts that you find obvious.
Context: I’m trying to make sense of dopamine in the brain—and decision-making and motivation more generally. This post is me playing with ideas; expect errors and omissions (and then tell me about them!).
This post is a bit long; I’m worried no one will read it. So in a shameless attempt to draw you in, here’s a Drake meme...
(Thanks Adam Marblestone, Trenton Bricken, Beren Millidge, Connor Leahy, Jeroen Verharen, Ben Smith, Adam Shimi, and Jessica Mollick for helpful suggestions and criticisms.)
If you haven’t read The Alignment Problem by Brian Christian, then you should. Great book! I’ll wait.
…Welcome back! As you now know (if you didn’t already), Temporal Difference (TD) learning (wiki) is a reinforcement learning algorithm invented by Richard Sutton in the 1980s. Let’s say you’re playing a game with a reward. As you go through, you keep track of a value function, a.k.a. “expected sum of future rewards”. (I’m ignoring time-discounting for simplicity.)
How do you know the expected future reward? After all, predictions are hard. So you start with a random function or constant function or whatever as your value function, and update it from experience to make it more and more accurate. For example, maybe you seem to be in a very bad chess position, with the queen exposed in the center of the board. Then you make a move, and then your opponent makes a move, and then all of the sudden you’re in a very very good position! Well, hmm, maybe your old position, with the exposed queen, wasn’t quite as bad as you thought!! So next time you’re in a similar situation, you’ll be a bit more optimistic about things—i.e., you now assign that position a higher value.
That’s an example of a positive reward prediction error (RPE). The general formula for RPE in TD learning is:
Reward Prediction Error = RPE = [(Reward just now) + (Value now)] – (Previous value)
...and this RPE is used to update the previous value.
So that’s TD learning. If you keep iterating, the value function converges to the desired “value = time-integral of expected future rewards”.
Now in the 1980s-90s, Wolfram Schultz did some experiments on monkeys, while measuring the activity of dopamine neurons in the midbrain.
I’ll pause here to help the non-neuroscientists follow along. In the midbrain (part of the brainstem) are two neighboring regions called “VTA” and “SNc”. In these regions you find the inputs and cell bodies (dendrites and somas) of almost all the dopamine-emitting neurons in the brain. These neurons’ axons (output lines) then generally exit the midbrain and go off to various distant regions of the brain, and that’s where they dump their dopamine.
Anyway, Schultz found (among other things) three intriguing results:
Peter Dayan and Read Montegue saw the connection: All three of these results are perfectly consistent with dopamine being the RPE signal of a TD learning algorithm! This became a celebrated and widely-cited 1997 paper, and a cornerstone of much neuroscience research since.
Oh, one more terminology side note:
There are a number of wrinkles suggesting that there’s more to the story than simple TD learning:
To make sense of these facts, and much more, let’s dive deeper into how the brain is built and what different parts are doing!
Dopamine is closely related to a circuit called the cortico-basal ganglia-thalamo-cortical loop. I’ll just call it “loop” for short.
A classic 1986 paper found that a bunch of brain circuitry consists of these loops, running in parallel, following a path from cortex to basal ganglia (striatum then pallidum) to thalamus and then back to where it started. (If you’re not familiar with the neuroanatomy terms here, that’s fine, I’ll get back to them.)
(Side note to appease the basal ganglia nerds: This loop is real, and it’s important, but I’m leaving out various other branches and supporting circuitry needed to make it work—see the frankly terrifying Fig. 6 here. I’ll get back to that below, but basically this simplified picture of the loop will be good enough to get us through this post.)
What are these loops doing? That’s going to be a big theme of this blog post. I’ll get back to it.
Where are these loops? All over the telencephalon. And what is the telencephalon? Read on:
The telencephalon (aka “cerebrum”) is one of ~5 major divisions of the brain, differentiating itself from the rest of the brain just a few weeks into human embryonic development. The telencephalon is especially important in “smart” animals, comprising 87% of total brain volume in humans (ref), 79% in chimps (ref), 77% in certain parrots, 51% in chickens, 45% in crocodiles, and just 22% in frogs (ref). The human telencephalon consists of the neocortex (“the home of human intelligence”, more-or-less), some non-“neo” cortex areas like the hippocampus (classified as “allocortex”, more on which below), as well as the basal ganglia, the amygdala, and various more obscure bits and bobs. It seems at first glance that “telencephalon” is just a big grab-bag of miscellaneous brain parts—i.e., a category that only embryologists have any reason to care about. At least, that was my working assumption. ...Until now!
It turns out, when people peered into the telencephalon, they found a unifying structure hidden beneath! The breakthrough, as far as I can tell, was Swanson 2000. I learned about it mainly from the excellent book The Evolution of Memory Systems by Murray, Wise, Graham. (Or see this shorter paper by the same authors with the relevant bit.)
It turns out that there’s a remarkable level of commonality across these superficially-different structures. The amygdala is actually a bunch of different substructures, some of which look like cortex, and others that look like striatum. There’s a thing called the “lateral septum” whose neurons and connectivity are such that it looks like “the striatum of the hippocampus”. And practically everything is organized into those neat parallel loops through cortex-like, striatum-like, and pallidum-like layers!
Cortex-like part of the loops
Amygdala [basolateral part]
Ventromedial prefrontal cortex
Motor & “planning” cortex
Striatum-like part of the loops
Amygdala [central part]
Pallidum-like part of the loops
The entire telencephalon—neocortex, hippocampus, amygdala, everything—can be divided into cortex-like structures, striatum-like structures, and pallidum-like structures. If two structures are in the same column in this table, that means they’re wired together into cortico-basal ganglia-thalamo-cortical loops. This table is incomplete and oversimplified; for a better version see Fig. 4 here.
So many loops! Loops all over the place! Are all those different loops doing the same type of calculation? Let’s take that as a hypothesis to explore, and see how far we get. (Preview: I’m actually going to argue that the loops are not all doing the exact same calculation, but that they’re similar—variations on a theme.)
I’ve previously written (here) about, um, let’s call it, “neocortical learning-from-scratch-ism”. That’s the idea that the neocortex starts out totally useless to the organism—outputting fitness-improving signals no more often than chance—until it starts learning things. In particular, if this idea is right, then all adaptive neonatal behavior is driven by other parts of the brain, especially the brainstem and hypothalamus. That idea doesn't sound so crazy after you learn that the brainstem has its own whole parallel sensory-processing system (in the midbrain), and its own motor-control system, and so on. (Example: apparently the mouse has a brainstem bird-detecting circuit wired directly to a brainstem running-away circuit.) In that previous article I called this idea “blank slate neocortex”, which in retrospect was probably an unnecessarily confusing and clickbaity terminology. Here’s a pair of alternate framings that maybe makes the idea seem a bit less wild:
OK, so I’ve already been a “neocortical learning-from-scratch-ist” since, like, last year. And from what little I know about the hippocampus, I think of it as a thing that stores memories (whether temporarily or permanently, I’m not sure), so I’ve always been a “hippocampal learning-from-scratch-ist” too. The striatum is another part of the telencephalon, and as soon as I started reading and thinking about its functional role (see below), I felt like it’s probably also a learning-from-scratch component.
...I seem to be sensing a pattern here...
Oh what the hell. Maybe I should be a learning-from-scratch-ist about the whole frigging telencephalon. So again, that would be the claim that the whole telencephalon starts out totally useless to the organism—outputting fitness-improving signals no more often than chance—until it starts learning things within the animal’s lifetime.
Incidentally, I’m also a cerebellum learning-from-scratch-ist (see my post here). So I guess I would propose that as much as 96% of the human brain by volume is “learning from scratch”—pretty much everything but the hypothalamus and brainstem. Sounds like a pretty radical claim, right? ...Until you think, ‘Hang on, isn't the information capacity of the brain like 10,000× larger than the information content of the genome? So maybe that’s not a radical claim! Maybe I should be saying to myself, "Only 96%??"
Anyway, I haven’t dug (much) into the evidence for or against telencephalic learning-from-scratch-ism, and I'm not sure what other thinkers think. But I’m taking it as a working assumption—a hypothesis to explore.
...And I’m already finding it a very fruitful hypothesis! In particular, I was not previously thinking about the amygdala in a learning-from-scratch-ism framework. And then when I tried, everything kinda clicked into place immediately! Well, at least, compared to how confused I was before. I’ll discuss that below.
So that was learning-from-scratch-ism. Separately, I’ve also previously written about “neocortical uniformity” (e.g. here, here)—the hypothesis that every part of the neocortex is more-or-less running the same learning-and-inference algorithm in parallel. To be clear, if this idea is correct at all, then it definitely comes with two big caveats: (1) the learning algorithm has different “hyperparameters” in different places, and (2) the neocortex is seeded with an innate gross wiring diagram that brings together different information streams that have learnable and biologically-important relationships (ML people can think of it as loosely analogous to a neural architecture).
So anyway, if “neocortical uniformity” is the idea that every part of the neocortex is running a more-or-less similar learning algorithm, then I guess “telencephalic uniformity” would say that not only the whole neocortex but also the hippocampus, the cortex-ish part of the amygdala, etc. are doing that same algorithm too. And likewise that all the striatum-like stuff is running a "common striatal algorithm", and so on.
Do I believe that? To a first approximation:
Instead of “uniformity” for the cortex layer, maybe I’ll go with “family resemblance”. They are, after all, literally family. Like, we mammals have that 6-layer-neocortex-vs-3-layer-allocortex distinction I mentioned, but our ancestors probably just had uniform allocortex architecture everywhere (and most modern reptiles still do) (ref). (Birds independently evolved a different modification of the allocortex, with a functionally-similar end result, I think.)
(Fun fact: The “basolateral complex” of the amygdala is apparently neither allocortex nor neocortex per se, but rather a bottom layer of neocortex that peeled off from the rest! Not sure whether it's just detached spatially while still being wired up as a traditional layer 6B, or whether its current wiring is now wholly unrelated to its historical roots, or what. The claustrum is also in this category, incidentally. See Swanson 1998 & 2000.)
Finally, let’s get back to the cortico-basal ganglia-thalamo-cortical loop. What is it for? Here’s the toy model currently in my head. It has two parts, inference (what to do right now) and learning (editing the connections so that future inference steps give better answers). Here they are:
I make no pretense to originality here, and this model is obviously oversimplified, but it’s serving me well so far. So, print these pictures out, tattoo them on your eyelids, whatever, because I’ll be going back to these over and over for the rest of this blog post.
Here is some discussion and nuances to go along with the toy loop model:
The “value function” calculated by the striatum is not as simple as a database with one entry for each possible thing to do. Among other things, it’s bound to be context-dependent. Singing in the shower is good, singing in the library is bad.
Here’s a real example. In Fee & Goldberg 2011, they studied zebra finches learning to sing their song (it’s a lovely song by the way, here’s a video). I have to pause here to warn that bird papers are annoying because practically every part of the bird telencephalon has a different name from the corresponding part of the mammal telencephalon. Anyway, if you look at Figure 4B, “HVC” (some part of the cortex) is providing high-level context—what song am I singing and how far along am I in that song? Meanwhile “LMAN” (a different part of the cortex) seems to be lower-level: if I understand correctly, it holds a catalog of sounds that the bird knows how to make, and how to make them. Then we have the enigmatically-named “Area X”, which is part of the striatum. This HVC signal is widely broadcasting its context signal into Area X, while meanwhile LMAN is making narrow, topographic connections to Area X, which then loop back to the exact same part of LMAN. Thus the bird can learn that making a certain sound is high-value in some specific part of the song, but low-value in a different part of the song.
This context idea is reflected in the relative sizes of different parts of the basal ganglia:
One of the remarkable features of [basal ganglia] organization is the massive convergence at every level from cortex to [striatum neurons] to pallidal neurons, and to thalamic neurons that project back to cortex…. In rats, roughly three million [striatum neurons] converge onto only 30,000 pallidal output neurons and subsequently onto a similar number of thalamic neurons .... In humans, a similarly massive convergence from >100 million [striatum neurons] to <50,000 pallidal neurons is reported…. In the context of our model, the reason for this convergence becomes apparent. If the role of Area X [=striatum] is to bias the variable activity of LMAN [=low-level motor cortex] neurons, then the feedback from DLM [=pallidum] to LMAN requires only as many channels as LMAN contains. In contrast, [striatum neurons] in Area X evaluate the performance of each LMAN channel separately at each moment in the song, which requires many more neurons. (Fee & Goldberg 2011)
You’ll note that my toy loop model above avoids the term “reward”. It’s too specific. Go back and look at the diagram with an open mind. What is the dopamine signal signaling? Here’s the most general pattern:
A positive phasic dopamine signal tells a cortico-basal ganglia-thalamo-cortical loop to be more active next time we’re in a similar situation. A negative phasic dopamine signal tells a loop to be less active next time we’re in a similar situation.
My proposal is that this pattern is valid for all loops, but nevertheless different parts of the telencephalon use different dopamine signals to do different things. Remember, as I mentioned above, we already know that there isn’t just one dopamine signal.
I currently see three categories of (phasic) dopamine signals. I’ll list them here, then go through them one-by-one in the next subsections: (1) Reward Prediction Error (RPE) for what I’ll call the Success-In-Life Reward, the classic kind of “reward” that we intuitively think about, i.e. an approximation of how well the organism is maximizing its inclusive genetic fitness; (2) RPE for local rewards specific to certain circuits—e.g. negative dopamine specifically to motor output brain areas when a motor action is poorly executed; (3) supervised learning error signals—e.g. if you get whacked in the head, then there’s a hardcoded circuit that says “you should have flinched”, and that signal can train a loop that specifically triggers a flinch reaction.
Analogy: You did a multiple-choice test, and then later the teacher hands it back:
My discussion here will be a bit in the tradition of (and inspired by) Marblestone, Wayne, Kording 2016, in the sense that I’m arguing that part of the brain is running a learning algorithm, with different training signals used in different areas. The big difference is that I want to focus on the dopamine signals, whereas they focused on acetylcholine signals. (I think acetylcholine mainly controls learning rate, so it’s not directly a training signal.) Also, as mentioned above, I’m not telling the whole “different training signals in different places” story in this post; the other part of that story is predictive (a.k.a. self-supervised) learning, in which different parts of the cortex are trained to predict different things. But that learning algorithm is not related to the loops, and it’s not related to dopamine, and it’s outside the scope of this post. (It’s related to how the cortex “selects proposals”.)
Start with the classic, stereotypical kind of reward, the reward that says “pain is bad” and “social approval is good” and so on. By and large, this reward should be some kind of heuristic approximation to the time-derivative of the organism’s inclusive genetic fitness. I’ll call it “Success-In-Life Reward”, to distinguish it from other reward functions that we’ll discuss in the next section.
Where does that reward signal come from? My short answer is: the hypothalamus and brainstem calculate it, on the basis of things like pain inputs (bad!), sweet taste inputs (good!), hunger inputs (bad!), and probably hundreds of other things. Boy, I would give anything for the complete exact formula for Success-In-Life Reward! Like, it’s not literally “The Meaning Of Life”, but it might be the closest thing that neuroscience can get us. I’ll get back to this later.
You said “Reward Prediction Error” (RPE); where do the “predictions” come from? Hold that thought, we’re not ready to answer it yet, but I’ll get back to it in a later section.
Why is this particular reward function useful? Because parts of the telencephalon are “deciding what to do” in a general way. Should you go out in the rain or stay inside? Should you eat the cheese now or save it for later? If the animal is to learn to make systematically good decisions of this type, then we need the decisions to be made by a learning algorithm trained to maximize “Success-In-Life Reward”. So I especially expect this reward function to be used for parts of the brain making high-level decisions involving cross-domain tradeoffs.
Those areas include, I think, at least some parts of “granular prefrontal cortex” (don’t worry if you don’t know what that means) and the hippocampus. These areas are both making decisions involving cross-domain tradeoffs. Like in humans, the former is the place that “decides” to bring to consciousness the idea “I’m gonna roast some vegetables!” (out of all possible ideas that could have been brought to consciousness instead). And the hippocampus is the place that “decides” to bring to consciousness the idea “I’m gonna turn right at the fork to go to the farmstand!” (out of all possible navigation-related ideas that could have been brought to consciousness instead). Something like that, maybe, for example.
Think about the Millenium Falcon, with Han Solo in the gun turret while Chewbacca is up front piloting. Chewbacca’s steering could be perfect while Han’s aim is terrible, or vice-versa. If they have to share a single training signal, then the signal will be noisier for each of them—for example, sometimes Han will do a bad job, but still get a high score because Chewy did unusually well, and then Han will internalize that wrong message. This isn’t necessarily the end of the world—I imagine that, if you do it right, the noise will average away, and they’ll eventually learn the right thing—but the learning process may be slower. So I figure that if it’s possible to allocate credit and blame for performance variation between Han and Chewy, they would probably learn faster.
By the same token, your own body is a lumbering contraption controlled by thousands of dials and knobs in the brain, and different parts of your cortex are in control of different parts of this system. If the brain can allocate credit, and thus send different rewards to different areas, then I imagine that it will.
(Side note 1: I imagine that some ML readers are instinctively recoiling here: "Nooooo, The Bitter Lesson says that we'll get the best results by using end-to-end performance as the one and only input to our learning algorithm!" Well readers, if you want to think about animals, I think you'll need to put a bit more emphasis on "learning fast" and a bit less emphasis on “asymptotic performance”, compared to what you’re used to. After all, Pac-Man can keep learning after getting eaten, but an animal brain can't—well, not usually. So you gotta learn fast!)
(Side note 2: Backpropagation (and its more-biologically-plausible cousins) can allocate credit automatically. However, they require error gradients to do so. In supervised (or self-supervised) learning, that’s fine: we get an error gradient each query. But here we’re talking about reinforcement learning, where error gradients are harder to come by, as discussed in a later section.)
Here’s the clearest example I’ve seen in the literature:
For example, we recently identified song-related auditory error signals in dopaminergic neurons of the songbird ventral tegmental area (VTA).... We discovered that only a tiny fraction (<15%) of VTA dopamine neurons project to the vocal motor system - yet these were the ones that encoded vocal reinforcement signals. The majority of VTA neurons which project to other parts of the motor system did not encode any aspect of song or singing-related error. —Murdoch 2018 describing a result from Gadagkar 2016
Got that? Birds have an innate tendency to sing, and they learn to sing well by listening to themselves, and doing RL guided by whether the song “sounds right”, as judged by some other part of the brain (I suspect the tectum, in the brainstem). And those specific dopamine feedback signals, the ones that say whether the song sounds good, go only to the singing-related-motor-control part of the bird brain. Makes sense to me!
This is more speculative, but seems to me that it should be feasible for some part of the brain to send a higher “reward” to a motor control loop when motion is rapid and energy-efficient and low-strain, and a lower “reward” when it isn’t. So I would assume that the “reward” going to low-level motor control loops should narrowly reflect the muscle’s energy expenditure, speed, strain, or whatever other metrics are biologically relevant.
(Some of you might be thinking here: What a stupid idea! If the reward is really like that, then the low-level motor control loops will gradually learn to do nothing whatsoever. That’s extremely rapid and energy-efficient and low-strain! Well, again, I'm hiding a lot of complexity behind the "propose candidate actions" part of my toy loop model above. If I’m not mistaken, the within-cortex dynamics will ensure that if a lower-level motor sequence isn't compatible with advancing the currently-active higher-level plan, then it won't get proposed in the first place!)
What of the literature? Schultz 2019 cites evidence for heterogeneous dopamine that goes along with larger movements, but not concise, stereotyped movements. That fits my theory pretty well: concise, stereotyped movements are more likely to use exactly the expected amount of energy and speed (and hence produce no motor-loop RPEs), whereas larger movements are likely to have some idiosyncratic differences in energy & speed compared to expectations (and hence produce positive or negative motor-loop RPEs). Moreover, the movement-related dopamine is heterogeneous, which is expected if some muscles are using slightly more energy than typical while others are using slightly less. Then my hypothesis would be that these movement-associated dopamine signals go specifically to brain areas associated with low-level motor control, but I haven’t yet found literature either for or against that.
Recently I wrote a post Is RL involved in sensory processing?, and was pretty weirded out when an astute reader told me that there was a non-frontal-lobe region of neocortex that had complete loops—namely inferotemporal cortex (IT) (ref). I was weirded out because I was thinking of the cortico-basal ganglia loops as part of an algorithm for choosing among multiple viable options—like what action to take, or what thought to think. Those choices tend to be made in the frontal lobe. IT, by contrast, is the home of visual object recognition, which I would think should not be a “choice”: Whatever the object is, that’s the right answer! So it seems like it calls for predictive (self-supervised) learning, not the loop algorithm.
Then I came across another weird thing! When the IT loops hit the striatum, that region is called “tail of the caudate”, and it’s one of the five regions flagged in a recent paper as an “aversion hot spot in the dopamine system”—i.e., it has elevated dopamine after bad things happen (when we traditionally expect low dopamine). The other four “aversion hot spot” regions, incidentally, are exactly the regions that I’ll talk about in the next section (dopamine for supervised learning of autonomic reactions), so that makes sense. But the IT loops need a different explanation.
So here’s my theory: IT is helping “choose” what object to attend to, within the visual field. If there’s a lion ready to pounce, almost perfectly hidden in the grass, and IT directs attention in a way that makes the subtle form of the lion visible ... well, seeing the lion is highly scary and aversive, so the high-level planner gets negative dopamine. But IT did the right thing here! Cha-ching, dopamine for IT! This sets up a kinda adversarial dynamic—IT focuses on anything in the visual field that might be dangerous or exciting, as judged by a different part of the brain, and the latter in turn then gets to hone its judgment on lots of edge-cases. This quasi-adversarial dynamic is good and healthy, and I think consistent with lived experience.
The fundamental difference between supervised learning and reinforcement learning is that in supervised learning, there’s a ground truth about “what you should have done”, and in the latter, there’s a ground truth about how successful some action was, but no specific advice about how to make it better. (Or in ML terminology, SL gets a loss gradient from each query, while RL doesn’t.)
I talked about this previously in Supervised Learning of Outputs in the Brain. I got some details wrong but I stand by the big picture I offered there. In particular, I think there are certain categories of telencephalon “outputs” for which the brainstem & hypothalamus can generate a “ground truth” after the fact about whether that output should have fired. For example, if you get whacked by a projectile, then your brainstem & hypothalamus can deduce that you should have flinched a moment earlier.
I think “outputs for which a ground truth error signal is available after the fact” include “autonomic outputs”, “neuroendocrine outputs”, and “neurosecretory outputs”, but not “neuromuscular outputs”. I’m quite unsure that I’m drawing the line in the right place here—in fact, I don’t know what half those terms mean—but for the purposes of this blog post I’ll just use the term “autonomic outputs” as a stand-in for this whole category.
While SL is different from RL, if you go back to the Toy Loop Model above, you’ll see that it works for both, with only minor modifications:
Some comments on this:
(The hippocampus is in that diagram, sending autonomic output suggestions to the hypothalamus, but I don’t think the hippocampus involves the kind of supervised learning loop that I’m discussing here. I think it uses a different mechanism—like, the hippocampus stores a bunch of locations, and each is tagged with the autonomic outputs that have previously happened at that location, and that’s what it suggests. So, basically, more like a lookup table.)
I admit I’m pretty hazy on the details here—like, if there’s an amygdala loop for releasing cortisol, and if there’s also an agranular prefrontal cortex loop for releasing cortisol, then how exactly are they related? I made a suggestion in my diagram above, but that might not be right, or might not be the whole story.
As an example of my confusion in this area, S.M., a person supposedly missing her whole amygdala and nothing else, seems to have more-or-less lost the ability to have (and to understand in others) negative emotions, but not positive emotions. But AFAICT the amygdala can trigger both positive- and negative-emotion-related autonomic outputs! Weird.
Now that we’ve covered the RL loops and the SL loops, we’re finally ready to tackle the reward prediction error! And it’s easy! See figure and caption. Some comments on that:
Of course for me, everything is always ultimately about AGI safety, and so is this post. Let’s go back to “telencephalic learning-from-scratch-ism” above—and let’s gingerly set aside the possibility that that hypothesis is totally false….
Anyway, there’s a learning algorithm in our brain. It’s initialized from random weights (or something equivalent). It gets various input signals including sensory inputs, dopamine and other signals from the hypothalamus & brainstem system, and so on. You run the code for some number of years, and bam, that learning algorithm has built a competent, self-aware agent full of ideas, plans, goals, habits, and so on.
Now, sooner or later (no one knows when) we’ll learn to build “AGI”—by which I mean, for example, an AI system that could have written this entire blog post much better than me. And here’s one specific way we could get this kind of AGI: We could code up a learning algorithm similar to the one in the telencephalon, and give it the appropriate input signals, run it for some period of time, and there’s our AGI.
…I do think the innate hypothalamus-and-brainstem algorithm is kinda a big complicated mess, involving dozens or hundreds of things like snake-detector circuits, and curiosity, and various social instincts, and so on. And basically nobody in neuroscience, to my knowledge, is explicitly trying to reverse-engineer this algorithm. I wish they would! I would absolutely encourage neuroscientists to push it upwards on their research priority list. But the way things look right now, I'm pessimistic that we’ll have made much progress on that, at least not by the time we have telencephalon-like learning algorithms coded up and working better and better.
So then we’re at the scenario I wrote about in My AGI Threat Model: Misaligned Model-Based RL Agent: we will know how to make a “human-level-capable” learning algorithm, but we won’t know how to send it reward and other signals that sculpt the learning algorithm into having human-like instincts and drives and goals. So researchers will mess around with different simple reward functions—as researchers are wont to do—and they’ll wind up training superhuman AGIs with radically nonhuman drives and goals, and they’ll have no reliable techniques to set, or change, or even know what the AGIs’ goals are. You can read that post. I do not think that scenario will end well. I think it will end in catastrophe! The solution, I think, is to do focused research on what the reward function (and other signals) should be—perhaps modified from the human hypothalamus and brainstem algorithm, or else designed from scratch.
So what was the point of me writing this blog post? I wanted to better understand the telencephalon learning algorithm’s “API”. Like, what is the suite of input signals (other than sensory data) that guides this learning algorithm, and how do they work? I don’t have a complete answer yet—there are still plenty of brain connections where I don’t know what the heck they’re doing. But I think I’m making progress! And this post in particular (and work leading up to it) constitutes a noticeable change:
I guess the biggest change is that now I think of two ways for the brain to evaluate how good a plan is. The fast (parallelizable) way uses the striatum, and helps determine which plans can rise to full attention. Then the slow (serial) but more accurate way involves the plan rising to full attention, at which point a maybe dozens-of-dimensional vector of auxiliary information about this plan’s expected consequences is calculated—if I do this plan, should I raise my heart rate? Should I salivate? Should I cringe? Etc. That auxiliary data vector gives the hypothalamus & brainstem something to work with when evaluating the plan. Regular readers of course will connect this to my earlier post Inner Alignment in Salt-Starved Rats, where I was relying on a mechanism like this to explain some animal experiments. Or think of the famous Iowa gambling task; here you can watch in real time (using skin conductance) as the supervised learning algorithm gradually improves the accuracy of the auxiliary data vector associated with each of two choices, until eventually the auxiliary data vector provides a clear enough signal to guide decision-making.
Update: see also follow-up A model of decision-making in the brain (the short version), where I put this diagram:
Like, I was thinking I should use this terminology:
These are listed in increasing order of, um, “fidelity” to the “intentions” of the innate hypothalamus & brainstem algorithm—and consequently the information tends to flow in the opposite direction, from third bullet to second to first. So “cached value function” is trained to imitate “actual value function”, and “actual value function” is ultimately reliant on information flowing from “reward function”. (I'm not sure this terminology is quite right, and also note that there isn't a sharp line between "actual value function" and "reward function".)
Anyway, when we think about how to control AGIs, the idea of “auxiliary data vectors” seems like an awfully important thing to keep in mind! What kinds of auxiliary data can we use to better understand and control our future telencephalon-like-learning-algorithm AGIs, and where do we get the ground truth to train the auxiliary-data-calculating subsystems? Umm, beats me, but it seems like an important question, and I’m still thinking about it, and let me know if you have any ideas.
Likewise, the hypothalamus and brainstem can output multiple region-specific RPEs, not just one. (They probably output various hyperparameter-modulating signals too.) Once again, our future telencephalon-like-learning-algorithm AGI will likewise also presumably be trained by sending different RPEs to different sub-networks. We’re the programmers, so we get to decide exactly what RPEs go to what sub-networks. Is there a scheme like that which will help keep our powerful AGIs reliably under human control? Once again, beats me, but I’m still thinking about it. And you can too! I’m probably the only person on earth being paid to think specifically about how to safely control telencephalon-like-learning-algorithm AGIs, and god knows I’m not gonna figure it out myself, I’m in way the hell over my head.
(If thinking directly about how to control futuristic powerful AGIs isn’t your cup of tea—and I admit it would be a tough sell on your next NIH grant application—at least let’s reverse-engineer “The Human Reward Function”, i.e. that innate hypothalamus & brainstem algorithm I keep talking about. That’s a conventional neuroscience research program, and it definitely helps the cause! Like, try writing down the part of the reward function that leads to jealousy. I don't think it's obvious! Remember, you’re not allowed to directly use common-sense concepts as ingredients in the reward function; the function needs to be built entirely out of information that the hypothalamus and brainstem have access to.)
(This post has a supplement here. Please leave comments at the lesswrong crosspost, or email me.)
What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)
I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)
I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity).
For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward prediction error” section has a very strange proposal for how RPE works; I don’t even remember why I ever believed that.
The main strengths of the post are the “normative” discussions: why might supervised learning be useful? why might more than one reward signal be useful? etc. I mostly stand by those. I also stand by “learning from scratch” being a very useful concept, and elaborated on it much more later.
Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the "Prefrontal Cortex as a Meta-RL System" of this blog.
In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.
If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016.
I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and more importantly I just don't find that thinking in those terms leads me to a better understanding of anything, but whatever, YMMV.
I don't agree with everything in the RL book chapter but it's still interesting, thanks for the link.
Right I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!
This was an amazing article, thank you for posting it!
The above isn't quite true in all senses in all RL algorithms. For example, in policy gradient algorithms (http://www.scholarpedia.org/article/Policy_gradient_methods for a good but fairly technical introduction) it is quite important in practice to subtract a baseline value from the reward that is fed into the policy gradient update. (Note that the baseline can be and most profitably is chosen to be dynamic - it's a function of the state the agent is in. I think it's usually just chosen to be V(s) = max Q(s,a).) The algorithm will in theory converge to the right value without the baseline, but subtracting the baseline speeds convergence up significantly. If one guesses that the brain is using a policy-gradients-like algorithm, a similar principle would presumably apply. This actually dovetails quite nicely with observed human psychology - good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in. For example, many people get shitty when it turns out they aren't going to end up having sex that they thought they were going to have - so here the theory would be that the baseline value was actually quite high (they were anticipating a peak experience) and the policy gradients update will essentially treat this as an aversive stimulus, which makes no sense without the existence of the baseline.
It's closer to being true of Q-learning algorithms, but here too there is a catch - whatever value you assign to never-before-seen states can have a pretty dramatic effect on exploration dynamics, at least in tabular environments (i.e. environments with negligible generalization). So here too one would expect that there is a evolutionarily appropriate level of optimism to apply to genuinely novel situations about which it is difficult to form an a priori judgment, and the difference between this and the value you assign to known situations is at least probably known-to-evolution.
That's interesting, thanks!
good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in.
I agree that this is a very important dynamic. But I also feel like, if someone says to me, "I keep a kitten in my basement and torture him every second of every day, but it's no big deal, he must have gotten used to it by now", I mean, I don't think that reasoning is correct, even if I can't quite prove it or put my finger on what's wrong. I guess that's what I was trying to get at with that "evolutionary prior" comment: maybe there's a hardcoded absolute threshold such that you just can't "get used to" being tortured, and set that as your new baseline, and stop actively disliking it? But I don't know, I need to think about it more, there's also a book I want to read on the neuroscience of pleasure and pain, and I've also been meaning to look up what endorphins do to the brain. (And I'm happy to keep chatting here!)
I don't have a full explanation of comparing-to-baseline. At first I was gonna say "it's just the reward-prediction-error thing I described: if you expect candy based on your beliefs at 5:05:38, and then you no longer expect candy based on your beliefs at 5:05:39, then that's a big negative reward prediction error. (Because the reward-predictor makes its prediction based on slightly-stale brain status information.) But that doesn't explain why maybe we still feel raw about it 3 minutes later. Maybe it's like, you had this active piece-of-a-thought "I'm gonna get candy", but it's contradicted by the other piece-of-a-thought "no I'm not", but that appealing piece-of-a-thought "I'm gonna get candy" keeps popping back up for a while, and then keeps getting crushed by reality, and the net result is a bad feeling. Or something? I dunno.
Oh, I think there's also a thing where the brainstem can force the high-level planner to think about a certain thing; like if you get poked on the shoulder it's kinda impossible to ignore. I think I have an idea of what mechanism is involved here … involving acetylcholine and how specific and confident the top-down predictions are, I'm hoping to write this up soon … That might be relevant too. Like if you're being tortured then you can't think about anything else, because of this mechanism. Then that would be like an objective sense in which you can't get used to a baseline of torture the way you can get used to other things.
One thing that strikes me as odd about this model is that it doesn't have the blessing of dimensionality - each plan is one loop, and evaluating feedback to a winning plan just involves feedback to one loop. When it's general reward we can simplify this with just rewarding recent winning plans, but in some places it seems like you do imply highly specific feedback, for which you need N feedback channels to give feedback on ~N possible plans. The "blessing of dimensionality" kicks in when you can use more diverse combinations of a smaller number of feedback channels to encode more specific feedback.
Maybe what seems to be specific feedback is actually a smaller number of general types? Like rather than specific feedback to snake-fleeing plans or whatever, a broad signal (like how Success-In-Life Reward is a general signal rewarding whatever just got planned) could be sent out that means "whatever the amygdala just did to make the snake go away, good job" (or something). Note that I have no idea what I'm talking about.
Right, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex.
Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional space of possible thoughts to think, just like RL. In this case, it's not "each plan is one loop", because there's a combinatorial explosion of possible thoughts you can think, and there are not enough loops for that. (It also wouldn't work because for pretty much every thought you think, you've never thought that exact thought before—like you've never put on this particular jacket while humming this particular song and musing about this particular upcoming party...) Instead I think compositionality is involved, such that one plan / thought can involve many simultaneous loops.
How does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot.
SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain outputs to correspond to certain behaviors in the first place, all with only one wire to each location! Maybe you could do this with a single signal that means both "imitate the current behavior" and also "learn to do your behavior in this context"? Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.
I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.
I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?
Well, I'm very much not an expert on how the brain wires itself up. But I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess?
Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.
Hmm, one reasonable (to me) possibility along these lines would be something like: "VTA has 20 dopamine output signals, and they're guided to wind up spread out across the amygdala, but not with surgical precision. Meanwhile the corresponding amygdala loops terminate in an "input zone" of the lateral hypothalamus, but not to any particular spot, instead they float around unsure of exactly what hypothalamus "entry point" to connect to. And there are 20 of these intended "entry points" (collections of neurons for flinching, scowling, etc.). OK, then during embryonic development, the entry-point neurons are firing randomly, and that signal goes around the loop—within the hypothalamus and to VTA, then up to the amygdala, then back down to that floating neuron. Then Hebbian learning—i.e. matching the random code—helps the right loop neuron find its way to the matching hypothalamus entry point."
I'm not sure if that's exactly what you're proposing, but that seems like a perfectly plausible way for the brain to orchestrate these connections during embryonic development. I do have a hunch that this isn't what happens, that the real mechanism is "molecular signaling" instead. But like I said, I'm not an expert, and I certainly wouldn't be shocked to learn that the brain embryonic wiring mechanism involves this kind of thing where it closes a loop by sending a random code around the loop and Hebbian-learning the final connection.
I enjoy that you have an algorithm which presumes the existence of some hypothetical mechanism, whereas researchers in labs have been elucidating these mechanisms for years without any necessarily coherent vision of agentic architectures <3
I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess?
Its like you don't know about keywords like "growth cone" or "chemotaxis" or attempts to visualize chemoattractant gradients!
One of the main idioms of brain wiring is basically for axon tips to do chemotaxis (often through various way stations, in sequence) and then if they find the right home base they notice and "decide" to survive, and otherwise they commit suicide and have to be cleaned up (probably to save on neural metabolic demands? and/or to reduce noise?) but then it seems like maybe there are numerous similar systems all kind of working in parallel, each with little details like the "homotopic connections" between each spot in one hemisphere and its rough cognate in the other hemisphere, through the corpus callosum?
The normal way it works, I think, is for people to get the big picture wiring diagram by simply looking, and then do biochemistry and so on, and then back their way into vague hunches about what algorithms could be consistent with such diagrams and mechanisms? You seem to be going in "algorithms first" instead :-)
Thanks!! And thanks for the wiring references! Such intricate complexity everywhere you look! Sometimes I wonder "how is there so much to say about neuroscience that we can write 50,000 neuroscience papers each year, year after year?", and then I see stuff like this and say "Oh, that's how." :-P
I haven't read the entire thing yet, so maybe I am missing something, but isn't the globus pallidus inhibitory? In this you stated that it amplifies signals. there should be a path from the cortex to the subthalamic nucleus that where the globus pallidus shuts down the idea fed to striatum. I like to think of the striatum as the dad, and the globus pallidus as the mom. Then the cortex is the idea that the kids(thalamus) come up with. The dad and mom both see the idea, and the mom always says no, unless the dad convinces her to do the thing. The globus pallidus internal can also send the thalamus to time out to suppress new ideas.
My main answer is that the “toy loop model”s here are pretty bad and shouldn’t be taken literally. I have an updated discussion here (posts 5-6 mostly), but even that has some issues; I made more progress in the last six months that I haven’t written up yet.
I’m more confident in the “‘Context’ in the striatum value function” section here. The convergence of many striatal neurons onto few “final answer” neurons (in both pallidum and SNr) seems pretty central to me. Kinda vaguely like the striatum is the final hidden layer, and pallidum / SNr / whatever neurons are “heads”, in a loose ML analogy.
To answer your question slightly, I’m working at a pretty high level (Marr’s “algorithm level”, I suppose) here. It’s possible to have a signal which is best thought of as exciting something, but is actually implemented by an inhibitory connection. For example, it could be “disinhibitory” (inhibiting an inhibitor). Swanson 2000 does indeed claim that pallidum-to-brainstem signals are disinhibitory, specifically by inhibiting the inhibitory striatum-to-brainstem signals.
But anyway, yeah, I would read this post as kinda “early attempt” rather than correct. A lot of the details are very much wrong. I’ll make the top-note more prominent.
But in experiments, they’re not synchronized; the former happens faster than the latter.
This has the effect of incentivizing learning, right? (A system that you don't yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.
Hmm, I guess I mostly disagree because:
I guess my sense is that most biological systems are going to be 'package deals' instead of 'cleanly separable' as much as possible--if you already have a system that's doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
I agree about the general principle, even if I don't think this particular thing is an example because of the "not maximizing sum of future rewards" thing.
Very interesting! I don't know much about the brain but I think this post did a good job of explaining the concept and showing it's importance. I wonder how the brain does this with neuroplasticity. I've read this article from MIT about researchers rewiring eye inputs to the audio processing parts of the brain. Would the hypothalamus have that hyper-prior on what eye data "looks like" and create loops and systems that could de-code that data and reintegrate it with the undamaged processing systems? Could an AI system just as easily create or re-use existing substructures within it's code? I'm too new to ML learning to know if models can add layers during deployment, or how generalizability could be made within neuro networks past training.
Cheers for the post, I find the whole series fascinating.
One thing I was particularly curious about is how these 'proposals' are made. Do you have a picture of what kind of embedding is used to present a potential action?
For example, is a proposal encoded in the activations of set of neurons that are isomorphic to the motor neurons and it could then propose tightening a set of finger muscles through specific neurons? Or is the embedding jointly learned between the two in some large unstructured connection, or smaller latent space, or something completely different?
The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now".
A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary motor cortex, supplementary motor cortex, etc. etc., are all doing the same kind of thing and should be lumped together.) The exact details of the data structures that the brain uses to store these sequences of motor commands are controversial and I don't want to get into it here…
Then the hardest case is the areas that "think thoughts", spawn new ideas, etc., all the cool stuff that leads to human intelligence. (e.g. dorsolateral prefrontal cortex I think.) Things like "I'm going to go to the store" or "what if I differentiate both sides of the equation?". Those things are clearly not isomorphic to a sequence of motor commands. It's higher-level than that. Again, the exact data structures and algorithms involved in representing and searching for these "thoughts" is a very big and controversial topic that I don't want to get into here…