bastak - LessWrong

Dopamine-supervised learning in mammals & fruit flies

Exception: if things are going exactly as expected, but it's really awful and painful and dangerous, there's apparently still a dopamine pause—it never gets fully predicted away

Interestingly, the same goes for serotonin - FIg 7B in Matias 2017 . But also not clear which part of raphe neurons does this - seems that there is a similar picture as with dopamine -projections to different areas respond differently to the aversive stimuli.

Maybe you're thinking: it's aversive to put something salty in your mouth without salivating first.

Closer to this. Well, it wasn't a fully-formed thought, just came up with the salt example and thought there might be this problem. What I meant is a sort of problem of the credit assignment: if your dopamine in midbrain depends on both cortical action/thought and assessor action, then how does midbrain assign dopamine to both cortex-plan proposers and assessors? I guess for this you need to have situation where reward(plan1, assessor_action1) > 0, but reward(plan1, assessor_action2) < 0, and the salt example is bad here because in both salivating/not salivating cases reward > 0. Maybe something like inappropriately laughing after you've been told about some tragedy: you got negative reward, but it doesn't mean that this topic had to be avoided altogether in the future (reinforced by the decrease of dopamine), rather you should just change your assessor reaction, and reward will become positive. And my point was that it is not clear how this can happen if the only thing the cortex-plan proposer sees is the negative dopamine (without additionally knowing that assessors also got negative dopamine so that overall negative dopamine can be just explained by the wrong assessor action and plan proposer actually doesn't need to change anything)

Dopamine-supervised learning in mammals & fruit flies

bastak3y30

Thanks for the answers!

I'm reluctant to make any strong connection between self-supervised learning and "dopamine-supervised learning" though. The reason is: Dopamine-supervised learning would require (at least) one dopamine neuron per dimension of the output space

I totally agree that there is not enough dimensionality of dopamine signals to provide the teaching feedback in self-supervised learning of the same specificity as in supervised learning.

What I was rather trying to say in general is that maybe dopamine is involved in self-supervised learning by only providing permissive signal to update the model. And was trying to understand how sensory PE is related to dopamine release.

For sensory prediction areas, cortical learning doesn't really need dopamine, I don't think

That's what I always assumed before Sharpe 2017. But in their experiments inhibition of dopamine release inhibits learning association between 2 stimuli: PE is still there, little dopamine release, no model is learned. By "PE is still there" I assume that PE gets registered by neurons, (not that mouse becomes inattentive or blind upon dopamine inhibition) but the model is still not learned despite (pyramidal?) neurons signaling the presence of PE, this is the most interesting case compared to just "gets blind" case. If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.

Even if this "dopamine permits to update sensory model" is true, I also don't get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE? But there is an interesting connection to schizophrenia: there is an abnormal dopamine release in schizophrenic patients - maybe they needlessly update their models because upregulated dopamine says so (found it in Sharpe 2020)

And the reward predictions should also converge to the actual rewards, which would give average RPE of 0, to a first approximation.

I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying "this was a really good plan, repeat it the next time", after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?

Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to "go eat" and your untrained amygdala says "we don't need to salivate", and you don't salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan "go eat" it's not that bad and it shouldn't be disrewarded that much just because amygdala screwed up and you didn't salivate, so the reward signal should be corrected somehow?

Dopamine-supervised learning in mammals & fruit flies

bastak3y30

Thanks for the post!

I wonder if dopamine also might be one of the key elements for self-supervised learning (predicting some sensory input based on previous sensory input - or is it supervised learning in your terminology?).

The reason to suspect dopamine in self-supervised learning is Sharpe 2017 paper - I was quite surprised that in their experiment dopamine could unblock learning and inhibiting dopamine release led to decreased association between two sensory stimuli.

How I am currently thinking about the model to interpret their result and how it relates to this drosophila dopamine/cerebellum model:

Consider their blocking paradigm:

A->X

AC->X

X->US

Then they say that dopamine serves as the proxy for prediction error (sensory prediction error triggers the release of DA?) and allows learning.

I am not sure that sensory prediction error always equals dopamine release (I mean dopamine release corresponding to sensory learning, not dopamine neurons that are more about RPE).

What I mean is that we can try to dissociate cases "we have sensory prediction error (PE)" and "we are learning/search for the better model". Then we have 4 options:

no PE, no learning better model: trivial situation
have PE, learning better model: you have unpredictable sensory input (X) and you allow your circuit to "listen" to the context lines extracting those that might be predictive of X. Dopamine supposedly allows extraction of this information and creating better model.
have PE, no learning better model: smth with the lack of attention? lights are flashing, but you don't care
no PE, learning better model: trickiest/rarest in reality case: this is what probably happens in their unblocking experiment: no PE during "AC->X", because A already predicts X. The funny thing is that C is also as good predictive signal as A (and which might be helpful to learn if the environment will change!), but because there is no PE that would trigger DA (which would trigger listening to the context lines of C), model is not updated. Only when they optogenetically release DA, it allows learning C->X. But I can imagine some weird cases where you can also learn C->X even though you already saw A->X and satisfied with this explanation when you see AC->X, e.g. you are super attentive and notice it, or somebody told you, but i suspect this is hard even for humans: once you converged onto some explanation, it's hard to consider alternatives:). In these cases I'd expect seeing DA release even though you don't have PE.

So explanation "DA triggers looking which context lines predict sensory input well and learning it" is consistent with Sharpe 2017 data and with this drosophila circuit. It also might be consistent with what you write about DA in IT to aversive/exciting stimuli: according to this "DA for self-supervised learning" explanation, DA in IT would mean: "hey IT, you are in the dangerous situation, even though you don't have any PE, extract some more predictive signals (this would be equivalent to starting to learn C-> X when presented with AC->X in blocking paradigm) - might be helpful in the future". In contrast, your explanation is more like (if I understood you correctly) "hey IT neurons that fired to the suspicious bush - do it the next time too". Correct me if you meant different thing or if these 2 explanations are not contradicting each other. Or DA might do both?

Is it also correct that DA for global/local RPEs and supervised/self-supervised learning in the completely naive brain should go in different directions? i.e. for the bad supervised learning model there should be tons of DA - a lot of errors, whereas RPE DA (the one you talk in Big Picture) should be low - no action/thought is good enough to be rewarded by the brainstem. If we assume that learning is triggered by high DA of course (Sharpe 2017 hints that it is high DA that triggers learning/PE, not low).

LESSWRONG
LW

Posts

Wiki Contributions

Comments