I’m biased: I have a strong prior belief that reinforcement learning should not be involved in sensory processing in the brain. (Update: I meant "directly involved", see comment.) (Update 2: I should have said "low-level sensory processing", see discussion of inferotemporal cortex here.) The reason is simple: avoiding wishful thinking.

If there’s a positive reward for looking at beautiful ocean scenes, for example, then the RL solution would converge towards parsing whatever you look at as a beautiful ocean scene, whether it actually is or not!

Predictive learning (a.k.a. self-supervised learning), by contrast, seems perfect for the job of training sensory-processing systems. Well, really weighted predictive learning, where some prediction errors are treated as worse than others, and where top-down attention is providing the weights. (Or something. Not sure about the details here.) Anyway, predictive learning does not have a “wishful thinking” problem; plus there’s a nice information-rich supervisory signal that provides full error vectors (which are used for error-driven learning) which are easier to learn from than 1D rewards. (Imagine a high school teacher grading essays. If they say “That essay was bad; next time try using shorter paragraphs”, that’s like an error gradient signal, telling the student how to improve. If they just say “That essay was bad”, that’s like a reward signal, and now making progress is much harder!!) So that’s been my belief: I say reward learning has no place in the sensory-processing parts of the brain.

So later on, I was excited to learn that the basal ganglia (which plays a central role in reinforcement learning) sends signals into the frontal lobe of the brain—the home of plans and motor outputs—but not to the other lobes, which are more involved in sensory processing. (Update: There's at least one exception, namely inferotemporal cortex; I guess the division between RL / not-RL does not quite line up perfectly with the division between frontal lobe / other lobes.) (Update 2: I have a lot more to say about the specific case of inferotemporal cortex here.) Anyway, that seemed to confirm my mental picture.

Then I was reading Marblestone Wayne Kording 2016 (let’s call it MWK), and was gobsmacked—yes, gobsmacked!—when I came across this little passage:

Reward-driven signaling is not restricted to the striatum [part of the basal ganglia], and is present even in primary visual cortex (Chubykin et al., 2013; Stănişor et al., 2013). Remarkably, the reward signaling in primary visual cortex is mediated in part by glial cells (Takata et al., 2011), rather than neurons, and involves the neurotransmitter acetylcholine (Chubykin et al., 2013; Hangya et al., 2015). On the other hand, some studies have suggested that visual cortex learns the basics of invariant object recognition in the absence of reward (Li and Dicarlo, 2012), perhaps using reinforcement only for more refined perceptual learning (Roelfsema et al., 2010).

But beyond these well-known global reward signals, we argue that the basic mechanisms of reinforcement learning may be widely re-purposed to train local networks using a variety of internally generated error signals. These internally generated signals may allow a learning system to go beyond what can be learned via standard unsupervised methods, effectively guiding or steering the system to learn specific features or computations (Ullman et al., 2012)."

Could it be?? Am I just horribly confused about everything?

So I downloaded and skimmed MWK’s key citations here. To make a long story short, I’m not convinced. Here are my quick impressions of the relevant papers ...

“A unified selection signal for attention and reward in primary visual cortex”, Stănişor et al. 2013.

Start with common sense. Let’s say you just love finding pennies on the sidewalk. It’s just your favorite thing. You dream about it at night. Now, if you see a penny on the sidewalk, you’re bound to immediately notice it and pay continuing attention to it. That’s obvious, right? The general pattern is: Attention is often directed towards things that are rewarding, and the amount of reward is likely to bear at least some relation to how much attention you pay. Moreover, the strength and direction of attention influences the activity of neurons all over the cortex.

Now, the authors did an experiment on macaques, where a dot appeared and they had to look at it, and the color of the dot impacted how much reward they got when they did so. I guess the idea was that they were controlling for attention because the macaques were paying attention regardless of the dot colors—how else would they saccade to it? I don’t really buy that. I think that attention has a degree—it’s not just on or off. Let’s say I love finding pennies on the sidewalk, and I like finding nickels on the sidewalk, but my heart’s not in it. When I see either a nickel or a penny, I’ll look at it. But I’ll look a lot more intently at the penny than the nickel! For example, maybe I was singing a song in my head as I walked. If I see the nickel, maybe I’ll look at the nickel but keep singing the song. I notice the nickel, but I still have some attention left over! If I see the penny, I’ll be so excited that everything else in my head stops in its tracks, and 100% of my attention is focused on that penny.

From that perspective, everything in the paper seems to actually support my belief that reward-based learning plays no role whatsoever in the visual cortex. Reward affects the frontal lobe, and then depending on the reward, the frontal lobe flows more or less attention towards the visual cortex. That would nicely explain their finding that: “The neuronal latency of this reward value effect in V1 was similar to the latency of attentional influences. Moreover, V1 neurons with a strong value effect also exhibited a strong attention effect, which implies that relative value and top–down attention engage overlapping, if not identical, neuronal selection mechanisms.”

“A Cholinergic Mechanism for Reward Timing within Primary Visual Cortex”, Chubykin et al. 2013.

The authors took rats and coaxed them to learn that after they did a certain thing (lick a thing a certain number of times), a rewarding thing would happen to them (they’re thirsty and they get a drop of water). Right when they expected the reward, there were telltale signs of activity in their primary visual cortex (V1).

I figure, as above, that this talk about “rewards” is a red herring—the important thing is that the reward expectation coincides with some shift of the rat’s attention, which has ripple effects all over the cortex, and thus all parts of the cortex learn to expect those ripple effects.

Then the experimenters changed the number of licks required for the reward. The telltale signs of activity, predictably, shifted to the new reward time ... but not if the experimenters infused the rat brain (or more specifically the part of the visual cortex where they were recording) with a toxin that prevented acetylcholine effects. How do I account for that? Easy: acetylcholine = learning rate—see separate post. No acetylcholine, no learning. The visual cortex is still a learning algorithm, even if it’s not learning from rewards, but rather learning to expect a certain attentional shift within the brain.

“Central Cholinergic Neurons Are Rapidly Recruited by Reinforcement Feedback”, Hangya et al. 2015.

The authors offer loads of evidence that reward and punishment cause acetylcholine to appear. But I don’t think they claim (or offer evidence) that the acetylcholine is communicating a reward. Indeed, if I’m reading it right, the acetylcholine spikes after both reward and punishment, whereas a reward signal like phasic dopamine needs to swing both positive and negative.

The evidence all seems to be consistent with my belief (see separate post) that acetylcholine controls learning rate, and that it’s biologically advantageous for the brain to use a higher learning rate when important things like reward and punishment are happening (and when you’re aroused, when a particular part of the brain is in the spotlight of top-down attention, etc.).

“Perceptual learning rules based on reinforcers and attention”, Roelfsema et al. 2010.

If I’m reading it right (a big “if”!), everything in this paper is consistent with what I wrote above and in the other post. Acetylcholine determines learning rate, and you learn better by having the capability to set different learning rates at different times and in different parts of the brain, and the presence of rewards and punishments is one signal that maybe the learning rate should be unusually high right now.


Well, those are the main citations. This is a very quick-and-dirty analysis, but I’m sufficiently satisfied that I’m going to stick with my priors here: reward-based learning is not involved in sensory processing in the brain.

(Thanks Adam Marblestone for comments on a draft.)


4 comments, sorted by Highlighting new comments since Today at 2:07 AM
New Comment

I’m biased: I have a strong prior belief that reinforcement learning should not be involved in sensory processing in the brain. The reason is simple: avoiding wishful thinking.

If there’s a positive reward for looking at beautiful ocean scenes, for example, then the RL solution would converge towards parsing whatever you look at as a beautiful ocean scene, whether it actually is or not!

That seems like a strong argument for why RL should not be the only mechanism for sensory processing, but a weaker one for why it shouldn't be involved at all?

Without looking at the papers you cited, one reason for one might expect RL to be involved in sensory processing would be the connection between perception and skill learning. Some of the literature on expertise that I've seen suggests that the development of a skill involves literally learning to see differently, in which case reinforcement learning should be associated with sensory processes, shaping one's perception as one practices a skill so that one starts to see things as an expert would.

E.g. Gary Klein:

In this project, we studied the way nurses could tell when a very premature infant was developing a life-threatening infection. Beth Crandall, one of my coworkers, had gotten funding from the National Institutes of Health to study decision making and expertise in nurses. She arranged to work with the nurses in the neonatal intensive care unit (NICU) of a large hospital. These nurses cared for newly born infants who were premature or otherwise at risk. 

Beth found that one of the difficult decisions the nurses had to make was to judge when a baby was developing a septic condition-in other words, an infection. These infants weighed only a few pounds-some of them, the microbabies, less than two pounds. When babies this small develop an infection, it can spread through their entire body and kill them before the antibiotics can stop it. Noticing the sepsis as quickly as possible is vital. 

Somehow the nurses in the NICU could do this. They could look at a baby, even a microbaby, and tell the physician when it was time to start the antibiotic (Crandall and Getchell-Reiter 1993). Sometimes the hospital would do tests, and they would come back negative. Nevertheless, the baby went on antibiotics, and usually the next day the test would come back positive.

This is the type of skilled decision making that interests us the most. Beth began by asking the nurses how they were able to make these judgments. ments. "It's intuition," she was told, or else "through experience." And that was that. The nurses had nothing more to say about it. They looked. They knew. End of story. That was even more interesting: expertise that the person clearly has but cannot describe. Beth geared up the methods we had used with the firefighters. Instead of asking the nurses general questions, such as, "How do you make this judgment?" she probed them on difficult cases where they had to use the judgment skills. She interviewed nurses one at a time and asked each to relate a specific case where she had noticed an infant developing sepsis. The nurses could recall incidents, and in each case they could remember the details of what had caught their attention. The cues varied from one case to the next, and each nurse had experienced experienced a limited number of incidents. Beth compiled a master list of sepsis cues and patterns of cues in infants and validated it with specialists in neonatology. 

Some of the cues were the same as those in the medical literature, but almost half were new, and some cues were the opposite of sepsis cues in adults. For instance, adults with infections tend to become more irritable. Premature babies, however, become less irritable. If a microbaby cried every time it was lifted up to be weighed and then one day it did not cry, that would be a danger signal to the experienced nurse. Moreover, the nurses were not relying on any single cue. They often reacted to a pattern of cues, each one subtle, that together signaled an infant in distress. [...]

In her project with the neonatal intensive care unit, Beth Crandall had asked the experienced nurses how they detected the early signs of sepsis. They told her it was experience and intuition. They did not know what they knew, because what they knew was perceptual-how to see. The only way Beth was going to find out anything useful was to have the nurses tell their stories of specific instances, each tied to a different set of perceptual cues. At the end of the interviews, Beth could draw all the stories together and compile a master list of cues to sepsis.

and David Chapman:

Because most of life is routine, and most objects and situations are mostly familiar, and because we’ve practiced our visual skills from early childhood, they suffice for most tasks, and go unnoticed. Needing to learn new visual skills is unusual for adults.

Some video games are an outstanding exception. Video games are designed to make learning new skills fun, and many games teach you to see in new ways. When you enter a new segment of the game, everything is happening much too fast; you have no idea where to look or what it means. Enemies come out of nowhere and kill you before you even see what they are. With practice, you learn to see things you couldn’t before, because you didn’t know how.

Image courtesy Evelyn Chai

You are sneaking along a gloomy passageway in the necromancer’s tower. Suddenly you die. WTF just happened??

You reload the game from the last save point.

You are sneaking along that passageway and out of the corner of your eye you see something violent happen on left side of the screen and then you die. You reload.

You are sneaking along, looking left, and a golem leaps out of the archway on your left side and kills you.

This time, you’re watching the archway cautiously, and when the golem leaps out you hit it with a lightning bolt. A moment later, something happens on the right and you die.

Next try, you zap the golem with a lightning bolt, you flick your eyes right, and as the tomb over there opens, you manage to get one of the zombies with a fireball. But another one kills you. You noticed that the headless zombie hesitated for a moment before attacking.

You zap the golem and incinerate the one-arm zombie with a fireball while the headless one gropes around. You cartwheel to dodge its attack, and finish it off with a mid-air flying dagger thrust. Awesomeness! Unfortunately the floorboards you land on are rotten and you fall through to your death.

… An hour later, you stroll through the tower, knocking off monsters and skipping traps without really thinking about it, because you have learned to perceive the meanings of routine necromantic phenomena. Archways harbor golems, zombies without heads can’t see, rotten floorboards are a bit darker than solid ones. Now you know what to look out for, and where to look to see it.

You whack the necromancer, collect the Arcane Eggbeater Of Destiny, and return with it to the College of Wizards to get your next homework assignment. [...]

What we’ve learned from visual psychology suggests that seeing involves learned, task-specific skills. It is contextual and purposive, which makes it a good fit for everyday, reasonable, routine activity. (And not such a great fit for objective rationality.)

Visual activity is not separate from the rest of what we’re doing. The phrase “hand-eye coordination” points at this. In video games, your visual actions that tell your lower-level visual processing systems what to do are just as much part of the skill of fighting a group of monsters as swinging your sword is. Shifts of visual attention, for instance, are seamlessly integrated with the rest of the killing dance. As a more mundane example, if you are looking for scissors, you’ll move your head as well as your eyes to check around the desk, shove clutter out of the way to see behind or beneath it, and eventually get up and go open a drawer and peer inside. Visual activity and bodily motions are entangled. [...]

Much of what you see, you see as something. You don’t see a textured black region of the visual image, you see a loudspeaker. Or a raven. It’s already a loudspeaker or a raven when you first experience it.7 Bottom-up vision has done that work for you.

What you see something as depends on your knowledge, context, and purposes. If you are familiar with moussaka and you see it on a plate in a restaurant, you’ll probably see it as moussaka. If you aren’t, you’ll probably see it as a mushy casserole. You can’t see it as moussaka, because that’s not part of your ontology. If you see moussaka on a city sidewalk, you might just see it as disgusting, potentially pathogenic slop that you want to avoid. What you see a clump of atoms as depends on what you are looking out for, and why. Although bottom-up processes can do much of the work for you, especially in the case of rigid manufactured objects like loudspeakers with standard shapes and colors, your top-down direction often also plays a critical role.

and Josh Waitzkin:

Let’s say that I spend fifteen years studying chess. [...] We will start with day one. The first thing I have to do is to internalize how the pieces move. I have to learn their values. I have to learn how to coordinate them with one another. Early on, these steps might seem complex. There is the pawn, the knight, the bishop, the rook, the queen, and the king. Each piece is unique, with its own strengths and weaknesses. Each time I look at a chess piece I have to remember what it is and how it moves. Then I look at the next piece and try to remember how that one moves. There are initially thirty-two pieces on a chessboard. To make a responsible chess decision, I have to look at all those pieces and check for captures, quick attacks, and other obvious possibilities. By the time I get to the third piece, I’m already a bit overwhelmed. By the tenth piece I have a headache, have already forgotten what I discovered about the first nine pieces, and my opponent is bored. At this point I will probably just make a move and blunder. 

So let’s say that now, instead of launching from the standard starting position, we begin on an empty board with just a king and a pawn against a king. These are relatively simple pieces. I learn how they both move, and then I play around with them for a while until I feel comfortable. Then, over time, I learn about bishops in isolation, then knights, rooks, and queens. Soon enough, the movements and values of the chess pieces are natural to me. I don’t have to think about them consciously, but see their potential simultaneously with the figurine itself. Chess pieces stop being hunks of wood or plastic, and begin to take on an energetic dimension. Where the piece currently sits on a chessboard pales in comparison to the countless vectors of potential flying off in the mind. I see how each piece affects those around it. Because the basic movements are natural to me, I can take in more information and have a broader perspective of the board. Now when I look at a chess position, I can see all the pieces at once. The network is coming together. 

Next I have to learn the principles of coordinating the pieces. I learn how to place my arsenal most efficiently on the chessboard and I learn to read the road signs that determine how to maximize a given soldier’s effectiveness in a particular setting. These road signs are principles. Just as I initially had to think about each chess piece individually, now I have to plod through the principles in my brain to figure out which apply to the current position and how. Over time, that process becomes increasingly natural to me, until I eventually see the pieces and the appropriate principles in a blink. While an intermediate player will learn how a bishop’s strength in the middlegame depends on the central pawn structure, a slightly more advanced player will just flash his or her mind across the board and take in the bishop and the critical structural components. The structure and the bishop are one. Neither has any intrinsic value outside of its relation to the other, and they are chunked together in the mind. 

This new integration of knowledge has a peculiar effect, because I begin to realize that the initial maxims of piece value are far from ironclad. The pieces gradually lose absolute identity. I learn that rooks and bishops work more efficiently together than rooks and knights, but queens and knights tend to have an edge over queens and bishops. Each piece’s power is purely relational, depending upon such variables as pawn structure and surrounding forces. So now when you look at a knight, you see its potential in the context of the bishop a few squares away. Over time each chess principle loses rigidity, and you get better and better at reading the subtle signs of qualitative relativity. Soon enough, learning becomes unlearning. The stronger chess player is often the one who is less attached to a dogmatic interpretation of the principles. This leads to a whole new layer of principles—those that consist of the exceptions to the initial principles. Of course the next step is for those counterintuitive signs to become internalized just as the initial movements of the pieces were. The network of my chess knowledge now involves principles, patterns, and chunks of information, accessed through a whole new set of navigational principles, patterns, and chunks of information, which are soon followed by another set of principles and chunks designed to assist in the interpretation of the last. Learning chess at this level becomes sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity. [...]

Most people would be surprised to discover that if you compare the thought process of a Grandmaster to that of an expert (a much weaker, but quite competent chess player), you will often find that the Grandmaster consciously looks at less, not more. That said, the chunks of information that have been put together in his mind allow him to see much more with much less conscious thought. So he is looking at very little and seeing quite a lot. This is the critical idea.I

Thanks! My current model is that

  1. The frontal lobe does involve RL, and this is used to think high-value thoughts and take high-value actions;
  2. One reason that thoughts / actions can be high-value is by acquiring valuable information, and one way they can do this is by directing saccades and attention towards parts of the visual field (or other sensory input) where valuable information is at;
  3. That corresponding sensory input processing area is still doing predictive learning, but it uses a higher learning rate when it is the focus of top-down attention, and therefore tends to develop a rich pattern-recognizing vocabulary that is lopsidedly tailored towards recognizing the types of patterns that carry valuable information from the perspective of the RL-based frontal lobe.

So RL is involved, just a step removed. (Maybe my post title was bad. :-P ) Do you think that's an adequate involvement of RL to explain those and other examples?

Maybe. :) I don't have much of a position on "which part of the brain is the sensory processing-related reinforcement learning implemented in", just on the original claim of "we shouldn't expect to find RL involved in sensory processing".

That's fair; the first sentence now has a caveat to that effect. Thanks again!