Hedonic Loops and Taming RL

[-]Charlie Steiner2y62

I don't think linearity of policies is the answer, at all. If I drive up to an intersection, I might want to have a policy of muscle contractions that causes me to turn left or turn right, but I don't want to turn an entropy-weighted angle between 90 and -90 degrees.

Approximate linearity might work better - each drive outputting preferences for different high-level policies, and policies being "added" by randomly selecting just one to implement. This doesn't work very well for feed-forward networks, but it works for recurrent networks - you can fall into an attractor state where you competently turn the car left over many time steps, even though you also had a chance of falling into the attractor state of turning right competently.

[-]Seth Herd2y40

Excellent point. The basal ganglia is thought to address this problem by "gating" one motor plan while suppressing the others that narrowly lost the competition for selection. It probably performs a similar function in abstract decision-making. See my paper Neural mechanisms of human decision-making (linked in another comment on this post) for more on this.

[-]Hoagy2y50

Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can't change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.

I don't understand why standard RL algorithms in the basal ganglia wouldn't work. Like, most RL problems have elements that can be viewed as homeostatic - if you're playing boxcart then you need to go left/right depending on position. Why can't that generalise to seeking food iff stomach is empty? Optimizing for a specific reward function doesn't seem to preclude that function itself being a function of other things (which just makes it a more complex function).

What am I missing?

[-]beren2y40

This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

[-]Seth Herd2y40

The above is describing the model-free component of learning reward-function dependent policies. The Morrison and Berridge salt experiment is demonstrating the model-based side, which probably comes from imagining specific outcomes and how they'd feel.

[-]beren2y20

This is where I disagree! I don't think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work

[-]Steven Byrnes2y40

how exactly does modulating the reward signal lead to rapid and flexible changes of behaviour?

On my current models, I would split it into three mechanisms:

The first involves interoceptive signals going from the hypothalamus / brainstem / etc. to the cortical mantle (neocortex, hippocampus, etc.), where at least part of it seems to be the same type of input data as any other sensory signal going to the cortical mantle (vision, etc.). Thus, you’re learning a predictive world-model that includes (among other things) predicting / explaining your current (and immediate-future) interoceptive state.

The second involves the “main” scalar RL system. We learn how interoceptive inputs are relevant to the rewarding-ness of different thoughts and actions in the same way that we learn how any other sensory input is relevant to the rewarding-ness of different thoughts and actions. This is closely related to so-called “incentive learning” in psychology. I have a half-written draft post on incentive learning, I can share it if you want, it doesn’t seem like I’m going to finish it anytime soon. :-P

The third involves what I call “visceral predictions”. The predictions are happening in part of the extended amygdala, part of the lateral septum, and part of the nucleus accumbens shell, more or less. These are probably hundreds of little trained models that learn predictive models of things like “Given what the cortex is doing right now, am I going to taste salt soon? Am I going to get goosebumps soon? Etc.” These signals go down to the brainstem, where they can and do feed into the reward function. I think this is at least vaguely akin to your reward basis paper. :)

[-]Seth Herd2y40

I agree that it's odd that RL as a field hasn't dealt much with multi-goal problems. I wrote two papers that address this in the neuroscience domain:

How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing

Neural mechanisms of human decision-making

The first deals a lot with the tangled terminology used in the field; the second implements a mixed model-free and model-based neural network model of human decision-making in a multi-goal task. It makes decisions using interactions between the basal ganglia, dopamine system, and cortex.

It implements the type of system you mention: the current reward function is provided to the learning system as an input. The system learns to do both model-free and model-based decision-making using experience with outcomes of the different reward functions.

Obviously I think this is how the mammalian brain does it; I also think that the indirect evidence for this is pretty strong.

I don't think this makes the learning space vastly larger, because the dimensionality of reward functions is much lower than that of environments. I won't reach for the fridge handle if I'm stuffed (0 on the hunger reward function); but neither will I reach for it if there's no fridge near me, or if it's a stranger's fridge, etc.

So that's on the side of how the brain does multi-goal RL for decision-making.

As for the implications for alignment, I'm less excited.

Being able to give your agent a different goal and have it flexibly pursue that goal is great as long as you're in control of that agent. If it actually gains autonomy (which I expect any human or better level AGI to do sooner or later, probably sooner), it's going to do whatever it wants. Which is now more complicated than a single-goal, model-based (consequentialist) agent, but that seems to be neither here nor there for alignment purposes.

This does help lead to a human-like AGI, but I see humans as barely-aligned-on-average, so that any deviation from an accurate copy could well be deadly. In particular, sociopaths do not seem to be adequately aligned, and since sociopathy has a large genetic component, I think we'd need to figure out what that adds. Odds are it's a specific pro-social instinct, probably implemented as a reward function. It's this logic that leads to Steve Byrnes' research agenda.

Maybe I'm not getting the full implications you see for alignment? I'd sure like to increase my estimate of aligning a brainlike AGI, because I think there's a good chance we get something in that general architecture as the first real AGI. I think there's a route there, but it's not particularly reliable. I hope we get a language model agent as the first AGI, since aligning those seems more reliable.

[-]beren2y40

Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I'm practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly 'misaligned' about higher level things like morality, empathy etc is because we dont actually have direct drives hardcoded in the hypothalamus for them the way we do for primary rewards. Higher level behaviours either socio-culturally learned through unsupervised critically based learning or derived from RL extrapolations from primary rewards. It is no surprise that alignment to these ideals is weaker. B.) That relatively simple control loops are very effective at controlling vastly more complex unsupervised cognitive systems.

I also agree this is similar to steven Byrnes agenda and maybe just my way to arrive at it

[-]TurnTrout2y40

This is because the [satisficing] objective is bounded and, if relatively easily achievable, there is a much less of a strong incentive towards instrumental convergence to generic power-seeking behaviour.

Depending on what we mean by "satisficing", I think this isn't (theoretically) true of the formal satisficer agent. Have you read my post Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability, or -- in its modern form -- Parametrically retargetable decision-makers tend to seek power?

(That said, I don't think those particular power-seeking arguments apply to the real-world policies which we describe as "satisficing.")

[-]Steven Byrnes2y20

This processing is controlled by a few specialized nuclei in the hypothalamus and has relatively simple slow control loop dynamics which simply apply inverse control to a prediction error outside the ideal range.

Maybe I’m picking a fight over Active Inference and it’s going to lead into a pointless waste-of-time rabbit hole that I will immediately come to regret … but I really want to say that “a prediction error” is not involved in this specific thing.

For example, take the leptin - NPY/AgRP feedback connection.

As you probably know, when there are very few fat cells, they emit very little leptin into the bloodstream, and that lack of leptin increases the activity of the NPY/AgRP neurons in the arcuate nucleus (directly via the leptin receptors on those neurons), and then those NPY/AgRP neurons send various signals around the brain to make the animal want to eat, and to feel hungry, and to conserve energy, etc., which over time increases the number of fat cells (on the margin). Feedback control!

But I don’t see any prediction happening in this story. It’s a feedback loop, and it’s a control system, but where’s the prediction, and where is the setpoint, and where is the comparator subtracting them? I don’t think they’re present. So I don’t think there are any prediction errors involved here.

(I do think there are tons of bona fide prediction errors happening in other parts of the brain, like cortex, striatum, amygdala, and cerebellum.)

See my post here.

[-]beren2y40

I also think this is mostly a semantic issue. The same process can be described in terms of implicit prediction errors where e.g. there is some baseline level of leptin in the bloodstream that the NPY/AgRP neurons in the arcuate nucleus 'expect' and then if there is less leptin this generates an implicit 'prediction error' in those neurons that cause them to increase firing which then stimulates various food-consuming reflexes and desires which ultimately leads to more food and hence 'correcting' the prediction error. It isn't necessary that anywhere there are explicit 'prediction error neurons' encoding prediction errors although for larger systems it is often helpful to modularize it this way.

Ultimately, though I think it is more a conceptual question of how to think about control systems -- is it best to think in terms of implicit prediction errors or just in terms of the feedback loop dynamics but it amounts to the same thing

[-]Steven Byrnes2y20

Let’s talk about model-free RL (leaving aside whether it’s relevant to neuroscience—I think it mostly isn’t).

If you have a parametrized reward function R(a,b,c…), then you can also send the parameters a,b,c as “interoceptive inputs” informing the policy. And then the policy would (presumably) gradually learn to take appropriate actions that vary with the reward function.

I actually think it’s kinda meaningless to say that the reward function is parametrized in the first place. If I say “a,b,c,… are parameters that change a parametrized reward function”, and you say “a,b,c,… are environmental variables that are relevant to the reward function” … are we actually disagreeing about anything of substance? I think we aren’t. In either case, you can do a lot better faster if the policy has direct access to a,b,c,… among its sensory inputs, and if a,b,c,… contribute to reward in a nice smooth way, etc.

For example, let’s say there are three slot machines. Every now and then, their set of odds totally changes, with no external indication. Whenever the switchover happens, I would make bad decisions for a while until I learned to adapt to the new odds. I claim that this is isomorphic to a different problem where the slot machines are the same, but each of them spits out food sometimes, and friendship sometimes, and rest sometimes, with different odds, and meanwhile my physiological state sometimes suddenly changes, and where I have no interoceptive access to that. When my physiological state changes, I would make bad decisions for a while until I learned to adapt to the new reward function. In the first case, I do better when there’s an indicator light that encodes the current odds of the three slot machines. In the second case, I do better with interoceptive access to how hungry and sleepy I am. So in all respects, I think the two situations are isomorphic. But only one of them seems to have a parametrized reward function.

[-]DragonGod2y20

Specifically, the experiments by Morrison and Berridge demonstrated that by intervening on the hypothalamic valuation circuits, it is possible to adjust policies zero-shot such that the animal has never experienced a previously repulsive stimulus as pleasurable.

I find this a bit confusing as worded, is something missing?

Hedonic treadmills in the brain

While the actual neuroscientific implementation is largely irrelevant for the existence proof, it is quite likely that understanding how such loops are implemented in the brain would provide insight into how to implement them in ML systems. Unfortunately, we are very far from understanding how even basic hedonic loops like those involved in food consumption work at a mechanistic level. In my opinion, how such dynamic control is achieved algorithmically, is actually one of the most important and fundamental unsolved questions on both RL theory and neuroscience.

From the neuroscientific perspective, taking feeding as an exemplar case, the start of the loop is moderately well characterized, and is controlled by the hypothalamus, which receives and releases various feeding inducers or inhibitors which monitor and control food and glucose levels in the blood and are released when food is tasted or detected in the stomach. This processing is controlled by a few specialized nuclei in the hypothalamus and has relatively simple slow control loop dynamics which simply apply inverse control to a prediction error outside the ideal range. This nucleus then projects to various regions in the brain, including the dopaminergic neurons in the VTA that are central to behavioural selection and reinforcement learning in the basal ganglia and cortex. Presumably, this signalling carries information that essentially tells the dopamine neurons to modulate their reward and reward prediction error firing -- presumably making food be more or less desired, as appropriate. However, here is the big puzzle: how exactly does modulating the reward signal lead to rapid and flexible changes of behaviour?

The standard model of the basal ganglia is as a model-free policy trained with RPE firing from dopamine neurons using temporal difference or actor-critic learning. This learning approach causes the network to learn an amortized policy that simply maps from states to good actions. Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can't change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch. This is because the both the policy (if it is a learnt neural network) and the value function amortize and compress huge amounts of information into a relatively small object. This would be expected to come at a cost to generality due to the nonlinear compression mapping. If the reward function changes, the value function and optimal policy changes in a nonlinear way which is hard to understand in general.

It is possible to argue that in humans and higher mammals this is dealt with using model-based planning in cortical substrate. However, this argument is insufficient on two grounds. Firstly, hedonic loops are not some kind of weird rare exception: they are fundamental to human decision-making and the human condition in general. They occur all day every day. If this argument was true, it would essentially mean that due to your reward function changing so rapidly, the entire model-free RL portion of your brain was useless (in fact, it would actually be counterproductive since it would keep pushing for old and obsolete policies!). Given the expense of maintaining a basal ganglia and model-free RL system, it seems extremely unlikely that this would be maintained as a spandrel or vestigial brain region. Secondly, and more conclusively, hedonic control loops are an evolutionarily ancient invention which does not depend on the cortex. This is for the obvious reason that fixed utility maximization is bad in almost any biological context -- almost all biological variables need to be satisficed and kept in a healthy range rather than maximized. They exist in insects which have complex behaviours but no cortex and operate entirely by model-free RL. Indeed, we are starting to understand the neural bases of these circuits in fruitflies. For hedonic loops to work at all, there must be some model-free RL algorithm which allows flexible goal changes to result in effective changes to the learnt policy without retraining. Moreover, this algorithm must be simple -- it was discovered by evolution in ancient prehistory and does not require highly developed brains with advanced unsupervised cortices. Beyond this, the algorithm must allow some kind of linear interpolation of policies -- i.e. you can smoothly adjust the 'weightings' between different objectives and the policies update seamlessly and coherently with the new weighted reward function.

Before getting to some proposed solutions, let's think about the larger picture. What does this mean? Essentially, that flexible, compositional policies are possible, and that policies can be smoothly interpolated between reward functions ^[4]. This is a huge amount of control to have over an RL agent. If we have the right algorithm and have access to the right levers, we can tweak the reward function as we go (including quite drastically) to modify and control behaviour over time. Essentially, there is some form of model-free RL that is intrinsically highly corrigible and tameable, where complex learnt policies can be flexibly controlled by relatively simple 'outer loops'. Evolution essentially uses these loops to control a wide variety of behaviours including maintaining homeostasis in many different domains simultaneously, as well as implementing our general hedonic loop. These loops mostly appear to be relatively simple hardcoded PID controllers which, for simple drives such as hunger, are likely implemented directly in the relevant hypothalamic nuclei.

Essentially, what this shows, is that it is possible to tame RL algorithms such that we have levers to update and control the effective reward function the learnt policies optimizes at runtime and without retraining. The question is what such algorithms are and how do we build them, since current model-free RL does not appear to have these nice properties.

I puzzled over this for a while in 2021, and eventually ended up writing a paper which proposed the reward basis model. Essentially, the idea here is that I showed that if you learn a set of different reward functions at once, and assume a fixed policy, then the value function of a linear combination of the reward functions can be expressed as a linear combination of the value functions for each reward. What this means, is that if you learn a set of reward function 'bases', and learn a value function for each reward basis, then you can instantly generalize to any value function in the span of the value bases. Given a value (or Q) function, it is then trivial to locally argmax this in a discrete action space to obtain an optimal policy. This is related to successor representations but is a more memory efficient (at the cost of flexibility) and is, in my opinion, a nicer decomposition overall. It is one of those things that is so simple I was amazed how nobody had discovered it for so long. I argued that this is probably what the basal ganglia are doing and how they achieve their behavioural flexibility to changing reward functions, and there is a fair bit of supporting, if circumstantial evidence for this hypothesis both from the heterogeneity of dopamine firing (different dopamine neurons represent different combinations of reward bases) as well as circuit-level evidence from fruit-flies, where the mushroom body, the region that coordinates model-free RL, actually appears to implement a very similar algorithm.

Taming RL with linear reinforcement learning

Having read more of the field in the last two years, I eventually stumbled upon Emo Todorov's work^[5] and realized that what I was doing was essentially rederiving (badly) the rudiments of a field called linear RL that Todorov and collaborators essentially invented between 2007 and 2012 and brought to a high degree of theoretical sophistication. I think this field has been surprisingly understudied -- almost nobody even within RL knows about it -- despite the power of its results.

Essentially, what they showed, is that it is possible to derive RL algorithms to solve a subclass of MDPs -- which they call linear MDPS (lMDPS). These algorithms have a number of nice properties -- firstly, subproblems of the Bellman recursion -- such as optimal action selection, can be solved analytically. Secondly, the resulting policies and value functions have incredibly nice properties. The most important one is essentially linearity -- both policies and value functions can be linearly composed. If you have two policies, then you can construct a linear combination of them, and this is the optimal policy for a reward function which is a linear combination of the reward functions used to train each policy individually. This means that it is straightforward to decompose a complex RL task into composable 'skills' which can be reused and recycled as needed. It also allows extremely powerful compositional generalization from a set of base policies to all policies in their span. From an alignment perspective, this would give us a set of useful and powerful control 'knobs' over the behaviour of a model which we could dynamically adjust to control behaviour. This could be done both autonomously with meta-level control systems, such as occur in the brain regulating homeostatic loops like feeding, but also *directly*.

While having very nice properties, linear MDPs are, of course, a highly restrictive set of all possible MDPs, and it was perhaps unclear how much this would transfer to more complex behaviours. However, recent results are showing that many of these properties are also present, at least to some extent in deep RL networks. For instance, heuristically, Haarnoja showed that a basic linear combination of entropy weighted policies works well heuristically. Moreover, a number of important recent theoretical papers have proven that this weighted combination works pretty well for entropically weighted policies. Indeed, people have also worked out how to build a predicate logic of policies. For instance it is possible to take unions, intersections, and AND, OR, and NOT of policies. This has been generalized to the concept of skill machines for policies which appear to allow a high degree of compositional generalization even in deep learning systems trained with RL.

My suspicion is that the surprising success of this approach is based on the fact that, in fact, DL networks appear to be secretly almost linear. This is supported by recent findings that some RL trained networks appear to have naturally learnt a linear world model. Suppose the latent space is good enough that policies can become a relatively simple (ideally linear) function of the latent states of this world model. In such a world, essentially all RL is linear RL, since the MDP 'state' that the RL algorithms operate on is in fact the linear latent state. Thus, the policies trained on this have all the nice properties of linear RL algorithms and are, in fact, highly controllable. Moreover, this could also be exploited directly if you are training a new policy on a given latent state, it may be possible to simply initialize the algorithm with an easily computable linear RL policy.

While still very speculative, the picture that may be emerging is that the learnt policies in RL are not intrinsically inscrutable objects, but in fact have a rich and mostly linear internal structure which gives us many levels to compose and control them from the outside. This both allows us to build agents that generalize better and can more flexibly adapt to changing reward functions, as well as are much more controllable and 'corrigible' to us than a pure fixed reward maximizer would be. It is important not to get carried away however. The evidence for this at large scales is speculative. There are definitely many potential obstacles in the path before we get reliable methods for controlling and composing RL policies at scale. However, the actual goal no longer seems completely insurmountable. We may just live in a world where RL turns out to be relatively controllable. If this is the case, then we would be able to get a significant amount of alignment mileage out of building value/reward control systems for our RL agents, and can likely control even very powerful RL agents this way. This is how it happens in the brain, where a very large and powerful unsupervised cortical system can be very well controlled by a few simple loops operating out of the hypothalamus. Designing and understanding the properties and failure modes of various outer value programs would be extremely important in making such systems robust enough to actually be safe. Additionally, the current niceness and linearity of deep networks seems to be coincidental and is also imperfect. It is likely that we can improve the intrinsic linearity of the internal representations through a variety of means -- including explicit regularization and designing architectures with inductive biases that encourage this structure of the latent space. Beyond this, a greater theoretical understanding of linear RL, including extensions from the discrete domain upon which it is defined to the continuous latent spaces of deep neural networks is sorely needed.

^{^}

Specifically, the utility function can no longer be only over the current external state. Either, the current state must be include the physiological state of the hedonic loop itself -- i.e. explicitly different reward function for each of the states of the hedonic loop -- or alternatively, the utility function can be over all *histories* of states. In either case, this significantly expands the dimensionality of the problem.

^{^}

A small subfield of AI safety work has explored simply not directly encoding or implementing maximizing agents. This is the idea behind quantilization, distribution matching, and regularization by impact measures.

^{^}

Our hedonic treadmills are surprisingly corrigible. Very rarely do we choose to fight them directly and when we do it is by hacking the outer control loops rather than attack the corrigibility of the objective itself. With food especially, there are a bunch of small counterexamples. Bulimics throw up food after eating to sate some of their hunger impulses by securing release of Ghrelin while making sure they actually absorb minimal calories from their food. Conversely, (and maybe apocryphally, the ancient romans would throw up food to make room for more food during banquets). Some psychological disorders, especially severe depression, also seem to short-circuit these hedonic loops to prevent a return from extreme sadness to normal baseline levels. Additionally, it is not clear whether the corrigibility is maintained under arbitrary self modification and reflection. For instance, it is likely that a lot of people would remove their homeostatic control loops to effectively wirehead themselves -- i.e. obtain pleasure without building any 'tolerance' to it. Indeed, these loops are one of the main mechanisms the brain uses to prevent itself from wireheading, with surprising but partial success.

^{^}

One trivial possibility is simply that the brain learns a separate policy for each reward function weighting. While very computationally and memory expensive, this is potentially a solution. However, we have fairly definitive experimental evidence that this is not the case in mammals. Specifically, the experiments by Morrison and Berridge demonstrated that by intervening on the hypothalamic valuation circuits, it is possible to adjust policies zero-shot such that the animal has never experienced a previously repulsive stimulus as pleasurable. This implies that if separate policies are learnt for each internal state, they cannot be learnt in a purely model-free manner which requires actually experiencing the stimulus to update the value function / policies.

^{^}

As a minor aside, I would also highly recommend Todorov's paper as a highly accessible and intuitive introduction to the core ideas of control as inference, from a primarily neuroscience perspective.

LESSWRONG
LW

LESSWRONG
LW

20

Hedonic Loops and Taming RL

20

Ω 10

20

Ω 10

Hedonic treadmills in the brain

Taming RL with linear reinforcement learning