The idea I do like
There's an idea I'm very fond of where:
- The neocortex (well, the telencephalon and thalamus, but let’s just call it “neocortex” for short) holds all of our world-knowledge, consciousness, intelligence, planning, reasoning, etc.
- The brainstem (well, brainstem & hypothalamus, but I’ll just call it “brainstem” for short) is full of lots of circuits that say “eating chocolate is good”, “leaning over a precipice is scary”, etc. (It also regulates your heart-rate and so on.)
This idea has gotten a bad rap from its association with discredited “triune brain theory”. Relatedly, people persist in describing the brainstem with scientifically-inaccurate nicknames like “old brain”, or “lizard brain”, or “reptilian brain”, etc. (See my discussion here.) I’ve even heard the brainstem called “monkey brain”—as if monkeys didn’t have a neocortex?! (Elon Musk is guilty of this, see this interview - 14:45.)
Still, despite its bad rap, I very much like this idea, and am trying to salvage its reputation.
The trope I don’t like
There’s a trope that goes along with this idea in the popular imagination. It goes something like this:
- The brainstem is the source of base and dishonorable goals like “I want to eat candy and watch TV”
- The neocortex is the source of respectable and honorable goals like “I want to unravel the mysteries of the universe”, “I want to get off my lazy butt and go to the gym”, etc.
For example, a couple places where I’ve recently seen this trope are Jeff Hawkins’s recent book and an old LessWrong post The AI Alignment Problem Has Already Been Solved(?) Once.
I don't like this trope. I propose to throw into the garbage, right next to the “lizard brain” terminology.
That said, it's not pulled out of thin air. People got the idea from somewhere—it’s gesturing towards a real thing, and I will talk about what I think that is below.
Why I think this trope is basically wrong
I’m a big believer in within-lifetime reinforcement learning as one of the key drivers of cognition—see Big Picture of Phasic Dopamine. So maybe 5 times per second, you think a little thought or make a little plan or whatever (“I’m going to pick up this pencil”, “I’m going to rewrite this sentence”, etc. etc.), that thought / plan gets evaluated by various other parts of your brain, culminating in the brainstem, and the brainstem then issues a reward pertaining to that thought, and that reward drives both decision-making (e.g. you won't pick up the pencil if that plan is judged as bad) and gradual learning to think better thoughts in the future.
There’s no room in this RL-type process for “motivations that come from inside the learning algorithm”. Like, that’s just not where motivations come from! Motivations come from rewards. Rewards are calculated in the brainstem. It’s as simple as that.
The kernel of truth
I think this wrong idea is pointing at a real phenomenon, and I'm going to try to explain it.
1. The neocortex can issue “the same plan” (same sequence of motor actions) framed in many different ways, and those different framings will get different rewards from the brainstem.
The brainstem is judging thoughts / plans, not actual "futures". It’s not omniscient! It's judging a map, not a territory! So you can have different ways to think about the same plan which result in different brainstem rewards.
- You can have a thought "I will go to the gym" which is negative-reward, and so you don't do it.
- Or you can have a thought "I will go to the gym and thus be healthy" which mixes negative and positive aspects, so maybe it's positive-reward on net, and so you do it!
There are within-neocortex dynamics that determine which of those two options gets proposed to the brainstem.
…And therefore there’s a sense in which the neocortex “gets credit” for the fact that you do in fact go to the gym.
…As long as we don’t forget that it’s the brainstem that judges “I will go to the gym” as bad, and it’s the brainstem that judges “I will be healthy” as good, and it’s the brainstem that judges the combined thought “I will go to the gym and thus be healthy” as slightly-good, and thus it’s the brainstem that enables that plan to actually get executed.
2. Some plans are rejected by the brainstem, but they have a desirable aspect / consequence, and then the neocortex may be rewarded for dwelling on the desirable aspect, and may run a search algorithm for ways to make that aspect actually happen
Continuing with our example, let’s say the brainstem is getting signals that the body is tired, so any plan that involves exertion loses a lot of points. It loses so many points that the trick from above no longer works; the “I will go to the gym and thus be healthy” thought is now net-negative (compared to its alternatives). So I’m not going to the gym.
Nevertheless, the thought “I will be healthy” continues to be appealing. So the neocortex—ever the goal-seeking algorithm—keeps searching, trying to construct a plan that will result in “I will be healthy” actually happening.
Hmm, is there a way that I can get healthy while continuing to sit on the couch? Nope.
OK, what about piling more positive-reward thought pieces into the mix? How about this plan: “I will go to the gym, and thus be healthy, and also I’ll be the kind of person who follows through on my commitments, and also I’ll impress my friends with my rock-solid abs, etc. etc.” Again, it’s the same sequence of motor actions, but this thought is framed to make it even more appealing to the brainstem. Nope, the brainstem still says “all things considered, this is still a bad plan”. And I notice that I am still sitting on the couch.
Anyway, that neocortex search process that I'm talking about here gets mapped intuitively into “my neocortex wants to be healthy, but my brainstem wants to stay on the couch”.
That’s not literally true: my brainstem wants both things! My brainstem’s endorsement of “I want to be healthy” is critical here—it’s powering the ongoing neocortex search algorithm activity! (Again, in my view, the neocortex won’t take any actions or even think any thoughts without the brainstem rewarding it for doing so.) But it’s also my brainstem’s dislike of getting off the couch that is causing the search to come up empty.
3. This search algorithm tends to immediately “snuff itself out” unless the desirable aspect is “honorable” / appealing upon reflection / leading to desirable follow-on consequences
So far this seems to be unrelated to base vs noble motivations. Where does that come from?
Well, let’s try flipping it around. The reverse would be as follows:
My neocortex proposes “I will go to the gym and be healthy”, the brainstem endorses it and that’s now the plan. But the gym doesn’t open for 10 minutes, so I’m sitting and waiting. Now my neocortex thinks the thought “I won't go to the gym, I will stay on the couch instead”. That’s an appealing thought! But the brainstem rejects it as a plan, because its appeal is not strong enough to outweigh the appeal of “I will be healthy”. So the neocortex search algorithm whirrs into action! Is there a way to frame the thought “I will stay on the couch instead” so as to make it more appealing to the brainstem? How about this: “I will stay on the couch, and thus avoid the risk of dropping a heavy weight on my foot, and also I won’t have to talk to Tony at the gym entrance, man I hate that guy.” Nope, not good enough, the brainstem says I’m sticking with the original plan of going to the gym.
So this neocortex search algorithm should correspondingly get mapped intuitively into “my neocortex wants to stay on the couch, but my brainstem wants to be healthy”!! As above, that’s not a technically correct description, just an intuition. But clearly, it’s not impossible for this kind of thinking process to happen.
Still, I do think this case I just walked through is an unusual case. We are more likely to have our neocortical search algorithm searching for ways to do honorable things like go to the gym, not searching for ways to stay on the couch.
Why isn't it symmetric?
I think the difference is: some things just get more appealing when you think about them more, and others get less appealing.
More mechanistically, the neocortex works on a principle of “thoughts tend to trigger other thoughts”, and those “other thoughts” can be positive-reward or negative-reward. Remember, the brainstem is powering this search process; we’ll only keep searching as long as the brainstem is liking what it’s seeing. If the search algorithm winds up spawning a bunch of secondary thoughts that the brainstem hates, those secondary thoughts will snuff out the search algorithm itself. And we’ll start thinking about something else instead.
That’s a bit abstract. Let’s try an example.
Let’s say I’m in a hurry to leave for an appointment, but my neocortex entertains a glimmer of the idea “I will stop and eat candy before I go”. The brainstem says “Nope, candy is good, but being late is really bad, plan rejected.” So the neocortex search algorithm spins up: is there a way to eat candy without being late? What does that search entail? Well, it immediately brings to the forefront of your mind the idea “I will eat candy”, and then it tries to build an acceptable and plausible plan that incorporates that idea—like filling in the missing pieces of a puzzle. However, holding “I will eat candy” at the forefront of your mind spawns a bunch of negative follow-on thoughts like “…and thus I will break my promise to myself”, “…and thus I won’t fit in my pants”, etc. The brainstem gets a whiff of those thoughts and it snuffs out the search process itself. The brainstem no longer sees any benefit in thoughts that involve searching for a way to eat candy, so the neocortex stops doing that search.
By contrast, if your search algorithm is looking for a way to do something that’s appealing on reflection, i.e. something that's desirable and whose follow-on consequences are also desirable, then the search algorithm won’t immediately snuff itself out. It can keep running until success, or until frustration sets in.
So I think that when people introspect about times that they’ve been searching for a way to “make themselves do something”, they’ll typically come up with examples where the thing was honorable / appealing upon reflection / having desirable consequences, since those are the searches that last for more than a fraction of a second.
3A. An important special case: self-reflective thoughts
When I talk about plans with “desirable consequences”, you’ll probably think of normal, causal consequences, like “I will go to the gym” → “I will be healthy”. That's one possibility, but another important case is "consequences for how we see ourselves, how others see us, how we present ourselves, etc.". For example, if you’re committed to dieting, then it’s possible that:
- The idea of “eating candy” is appealing
- The idea of “myself eating candy” is aversive
Get it? The first one involves taking certain actions, tasting certain tastes, etc. The second one involves the abstract idea that I will have eaten candy, and I’ll remember having done it, and when people ask me I’ll have to tell them about it, or lie.
The second thing is a consequence of the first thing. If I eat candy, then I will have eaten candy! So just as above, the desirability of self-reflective thoughts can determine whether the neocortex search algorithm works long and hard on finding a way to make a plan that passes muster with the brainstem, or whether the search algorithm immediately snuffs itself out.
I think that we have a lot of strong motivations about the kind of person we want to be. (…And the brainstem turns these motivations into thoughts and actions, just like every other motivation.) I think this is an important component of our social instincts.
So this gets back to “honorable motivations”, here in the sense of “motivations that we’re proud to have”. When we want something that we associate with an “honorable motivation”, it’s much likelier that the neocortex will search for a way to make it happen despite internal (motivational) obstacles, rather than the search algorithm running for a fraction of a second and then snuffing itself out. Again, this leads to the (misleading) intuition that the neocortex is trying to make us "act honorably" over the protests of our brainstem.
4. Willpower, akrasia, and “guilt by association”
Let’s go back to the example from earlier:
A: “I will go to the gym” is aversive (negative-reward)
B: “I will be healthy” is attractive (positive-reward)
A+B: “I will go to the gym and thus be healthy” is net slightly attractive (slightly positive reward)
One aspect of this is that, by yoking together A+B, A has become more appealing. That’s what I was talking about above.
But there’s another aspect: by yoking together A+B, B has become less appealing. (“Ugh, I’m not so sure about “being healthy”, it’s pretty exhausting!”)
This is a factor that constrains the ability of the neocortex search algorithm to successfully find strategies to frame inherently-unpleasant plans in a way that the brainstem finds net attractive.
Specifically, it’s not as easy as taking one strongly-motivating thought like “I want to follow through on my commitments”, and then attach that one thought to 500 different inherently-unpleasant tasks, one after another, and then we'll actually do all 500 of those things. Instead, each of those tasks drags down the concept of “I want to follow through on my commitments”, until that concept is so saddled with negative associations that it's no longer even able to escort itself through the brainstem, let alone anything else!
What about “the brainstem is stupid”?
I often talk about how the brainstem is stupid, it doesn’t know what’s going on in the world, etc. That seems inconsistent with me breezily talking about how the brainstem sees “going to the gym” as bad and “being healthy” as good—aren’t those kinda complex concepts? Well, my answer is that certain parts of the brain (agranular prefrontal and insular and cingulate cortex, ventral striatum, amygdala, hippocampus) use supervised learning to distill an arbitrary thought into a maybe dozens-of-dimensional vector space that the brainstem (and hypothalamus) can interpret. We can interpret this vector as answering questions like “If I do this plan, how appropriate would it be to cringe? To salivate? To release cortisol? To laugh? Etc.” This enables the brainstem to understand what’s going on well enough to issue appropriate rewards. See Big Picture of Phasic Dopamine. I glossed over this stuff in this post, in order to keep things simple.
What about “The Monkey And The Machine”?
Many readers here have seen Paul Christiano’s post The Monkey And The Machine. I don’t have any particular complaints there. I think of the post as being mostly about stuff that happens within the neocortex, where “monkey” is the trained model learned by the cortex, and “deliberator” is a subset of those learned processes that implements things like "explicit reasoning using language". (Somewhat related: Kaj Sotala’s post System 2 as working-memory augmented System 1 reasoning.) This is a reasonable and useful way to think about certain things. I don't think it has anything to do with what I'm talking about here.