(UPDATE 1½ YEARS LATER: I’ve learned a lot since writing this post, and you shouldn’t assume that I still endorse everything herein.)
I’m biased: I have a strong prior belief that reinforcement learning should not be involved in sensory processing in the brain. (Update: I meant "directly involved", see comment.) (Update 2: I should have said "low-level sensory processing", see discussion of inferotemporal cortex here.) The reason is simple: avoiding wishful thinking.
If there’s a positive reward for looking at beautiful ocean scenes, for example, then the RL solution would converge towards parsing whatever you look at as a beautiful ocean scene, whether it actually is or not!
Predictive learning (a.k.a. self-supervised learning), by contrast, seems perfect for the job of training sensory-processing systems. Well, really weighted predictive learning, where some prediction errors are treated as worse than others, and where top-down attention is providing the weights. (Or something. Not sure about the details here.) Anyway, predictive learning does not have a “wishful thinking” problem; plus there’s a nice information-rich supervisory signal that provides full error vectors (which are used for error-driven learning) which are easier to learn from than 1D rewards. (Imagine a high school teacher grading essays. If they say “That essay was bad; next time try using shorter paragraphs”, that’s like an error gradient signal, telling the student how to improve. If they just say “That essay was bad”, that’s like a reward signal, and now making progress is much harder!!) So that’s been my belief: I say reward learning has no place in the sensory-processing parts of the brain.
So later on, I was excited to learn that the basal ganglia (which plays a central role in reinforcement learning) sends signals into the frontal lobe of the brain—the home of plans and motor outputs—but not to the other lobes, which are more involved in sensory processing. (Update: There's at least one exception, namely inferotemporal cortex; I guess the division between RL / not-RL does not quite line up perfectly with the division between frontal lobe / other lobes.) (Update 2: I have a lot more to say about the specific case of inferotemporal cortex here.) Anyway, that seemed to confirm my mental picture.
Then I was reading Marblestone Wayne Kording 2016 (let’s call it MWK), and was gobsmacked—yes, gobsmacked!—when I came across this little passage:
Reward-driven signaling is not restricted to the striatum [part of the basal ganglia], and is present even in primary visual cortex (Chubykin et al., 2013; Stănişor et al., 2013). Remarkably, the reward signaling in primary visual cortex is mediated in part by glial cells (Takata et al., 2011), rather than neurons, and involves the neurotransmitter acetylcholine (Chubykin et al., 2013; Hangya et al., 2015). On the other hand, some studies have suggested that visual cortex learns the basics of invariant object recognition in the absence of reward (Li and Dicarlo, 2012), perhaps using reinforcement only for more refined perceptual learning (Roelfsema et al., 2010).
But beyond these well-known global reward signals, we argue that the basic mechanisms of reinforcement learning may be widely re-purposed to train local networks using a variety of internally generated error signals. These internally generated signals may allow a learning system to go beyond what can be learned via standard unsupervised methods, effectively guiding or steering the system to learn specific features or computations (Ullman et al., 2012)."
Could it be?? Am I just horribly confused about everything?
So I downloaded and skimmed MWK’s key citations here. To make a long story short, I’m not convinced. Here are my quick impressions of the relevant papers ...
Start with common sense. Let’s say you just love finding pennies on the sidewalk. It’s just your favorite thing. You dream about it at night. Now, if you see a penny on the sidewalk, you’re bound to immediately notice it and pay continuing attention to it. That’s obvious, right? The general pattern is: Attention is often directed towards things that are rewarding, and the amount of reward is likely to bear at least some relation to how much attention you pay. Moreover, the strength and direction of attention influences the activity of neurons all over the cortex.
Now, the authors did an experiment on macaques, where a dot appeared and they had to look at it, and the color of the dot impacted how much reward they got when they did so. I guess the idea was that they were controlling for attention because the macaques were paying attention regardless of the dot colors—how else would they saccade to it? I don’t really buy that. I think that attention has a degree—it’s not just on or off. Let’s say I love finding pennies on the sidewalk, and I like finding nickels on the sidewalk, but my heart’s not in it. When I see either a nickel or a penny, I’ll look at it. But I’ll look a lot more intently at the penny than the nickel! For example, maybe I was singing a song in my head as I walked. If I see the nickel, maybe I’ll look at the nickel but keep singing the song. I notice the nickel, but I still have some attention left over! If I see the penny, I’ll be so excited that everything else in my head stops in its tracks, and 100% of my attention is focused on that penny.
From that perspective, everything in the paper seems to actually support my belief that reward-based learning plays no role whatsoever in the visual cortex. Reward affects the frontal lobe, and then depending on the reward, the frontal lobe flows more or less attention towards the visual cortex. That would nicely explain their finding that: “The neuronal latency of this reward value effect in V1 was similar to the latency of attentional influences. Moreover, V1 neurons with a strong value effect also exhibited a strong attention effect, which implies that relative value and top–down attention engage overlapping, if not identical, neuronal selection mechanisms.”
The authors took rats and coaxed them to learn that after they did a certain thing (lick a thing a certain number of times), a rewarding thing would happen to them (they’re thirsty and they get a drop of water). Right when they expected the reward, there were telltale signs of activity in their primary visual cortex (V1).
I figure, as above, that this talk about “rewards” is a red herring—the important thing is that the reward expectation coincides with some shift of the rat’s attention, which has ripple effects all over the cortex, and thus all parts of the cortex learn to expect those ripple effects.
Then the experimenters changed the number of licks required for the reward. The telltale signs of activity, predictably, shifted to the new reward time ... but not if the experimenters infused the rat brain (or more specifically the part of the visual cortex where they were recording) with a toxin that prevented acetylcholine effects. How do I account for that? Easy: acetylcholine = learning rate—see separate post. No acetylcholine, no learning. The visual cortex is still a learning algorithm, even if it’s not learning from rewards, but rather learning to expect a certain attentional shift within the brain.
The authors offer loads of evidence that reward and punishment cause acetylcholine to appear. But I don’t think they claim (or offer evidence) that the acetylcholine is communicating a reward. Indeed, if I’m reading it right, the acetylcholine spikes after both reward and punishment, whereas a reward signal like phasic dopamine needs to swing both positive and negative.
The evidence all seems to be consistent with my belief (see separate post) that acetylcholine controls learning rate, and that it’s biologically advantageous for the brain to use a higher learning rate when important things like reward and punishment are happening (and when you’re aroused, when a particular part of the brain is in the spotlight of top-down attention, etc.).
If I’m reading it right (a big “if”!), everything in this paper is consistent with what I wrote above and in the other post. Acetylcholine determines learning rate, and you learn better by having the capability to set different learning rates at different times and in different parts of the brain, and the presence of rewards and punishments is one signal that maybe the learning rate should be unusually high right now.
Well, those are the main citations. This is a very quick-and-dirty analysis, but I’m sufficiently satisfied that I’m going to stick with my priors here: reward-based learning is not involved in sensory processing in the brain.
(Thanks Adam Marblestone for comments on a draft.)