From free-energy to embodied valuing: weighted active inference in everyday life

psuedonymized
7 min readJan 1, 2017

Now that we’ve seen how active inference with a weighted target distribution (let’s just say for short: weighted active inference) can behave equivalently to expected-utility maximization in some situations, when does it behave differently, and how?

In the first contrast to an agent with an a priori utility function, or an ordinary reinforcement learner, an active-inference agent with a target distribution will not seek out any single specific state to the exclusion of all others. Instead, they will seek to visit states or trajectories with the relative frequencies specified by their target distribution. Visiting even the best state too often will still generate normative prediction errors which the agent will have to resolve by visiting other states.

Next, the agent’s target distribution will account for causal structure, for context, and for multiple reinforcement signals.

This gets more interesting when you have multiple reinforcement/motivational signals across different sensory modalities (including interoceptive ones) and time-scales, with their own varying precisions, whose content is then integrated using hierarchical probability models. This would re-deploy the same mechanism used for learning overhypotheses to causal models for learning contextual hierarchies of normative signals, along with the similar machinery for inferring distal causes. We would thus have a decent imaginary sketch of how an embodied organism is tied by sensory signals to their own specific distal needs, interests, and goals. We’re starting to see how embodied creatures like us can have values which act upon us with normative force, and how the models and beliefs we form about those values can be correct or incorrect.

From here, for instance, I can imagine how to sketch the calculations that might go into, for instance, my decision to get this season’s influenza vaccination: suffer the normative prediction errors (ie: pain) of a needle-jab and an achy arm in order to better suppress the higher-level (larger timescale) normative prediction errors (ie: avoid the pain) of getting the flu.

Note that this model can also account for some of the failures by humans to act in “our own best interest”, specifically by accounting for our failure to know our own best interests. In the absence of sufficiently precise reward and normative precision signals, or if the brain does not process them properly, the passive empirical data will dominate the resulting target density. A person who worked this way wouldn’t act in according to their best interest, but would instead act towards the positive side of “normal”.

(This could potentially explain such effects as status-quo bias and the supposed “revolution of rising expectations”. A status-quo situation may be unpleasant (low reward, significant punishment), but without adjusting the target distribution towards higher reward and lower punishment, it will generate only small normative prediction errors. The normative prediction errors are what actually propel action, not reward and punishment themselves, so only when an update to reward or normative precision increases the prediction-error signal will action work to resolve that error.

Such a creature could easily be “led around” by the stimuli with the greatest normative precisions and rewards, even the point of neglecting the dominating, wide-receptive-field distal causes in their environment. This sounds almost exactly like what we see in real human beings when we neglect the long-term. It’s also a solid prediction for this theory, empirically: normative force is a precision parameter, so precise signals “in the moment” ought to shift people’s preferences (and give them more of a motivational “oomph”) better than vaguer signals “over time”. We then face the difficulty of using hierarchical structure learning to notice how important the background events actually are.)

There’s also the issue of making sure that a target distribution formed by some probabilistic model of the causal trajectories we want doesn’t accidentally specify things we don’t want. Uniform target distributions over observed or latent variables are trivial. However, if we use a non-uniform goal model, then we need to make sure that the resulting target distribution has a minimal divergence from the passive empirical distribution, while still encoding the goal we intend. This would give us a probabilistic way to encode a principle of minimal action when trying to write models of causal trajectories such as, “The strawberry is moved unharmed from one plate to the other, and nothing else.”

So what does all this buy us over and above plain active inference or plain reinforcement learning, other than mere psychological verisimilitude? Why do active inference with weighted target distributions?

“Adding back” reinforcement learning by weighting the target distributions according to reward, punishment, and normative precision signals lets us draw a clear line between embodied valuing and disembodied valuing, the kind of minds we are versus the kind we might build. We humans are embodied valuers: our target distributions are weighted according to signals whose nature is determined by our embodiment, and which thus “point at” the distal causes related to our physical natures. We approvingly allow those signals to update our target distributions in order to increase the mutual information between those distal causes (our “values”) and our target distribution.

For instance, consider the case described by Peter Railton of a traveler who wants a glass of milk:

For example, Lonnie, a traveler in a foreign country, is feeling miserable. He very much wishes to overcome his malaise and settle his stomach, and he finds he has a craving for the familiar: a tall glass of milk. The milk is desired by Lonnie, but is it also desirable for him? Lonnie-Plus can see what is wrong with Lonnie, in addition to homesickness, is dehydration, a common affliction of tourists, but one often not detectable from introspective evidence. The effect of drinking hard-to-digest milk would be to further unsettle Lonnie’s stomach and worsen his dehydration. By contrast, Lonnie-Plus can see that abundant clear fluids would quickly improve Lonnie’s physical condition — which, incidentally, would help with his homesickness as well.

By turning our lens of weighted target distributions and embodied cognition on Lonnie and Lonnie-Plus, we can provide a more reductionist account than Railton does. In the predictive processing paradigm, we can account for emotion in terms of interoceptive probabilistic inference: reasoning about the state of the body and the causes of that state. We can thus see that ordinary Lonnie is making an incorrect causal inference: he attributes his feeling to distance from home rather than to a physical need. We can also see why Lonnie ought to prefer water to milk: water will quench his dehydration and bring his body to a state in which it will stop sending negative reinforcement signals to his brain.

However, the bodily signals involved in Lonnie’s unease have a fairly low precision, while the various sensory signals involved in experiencing a foreign country have high surprisals. Lonnie is thus driven by approximate but rational probabilistic inference to misattribute why he feels bad (why his brain receives negative embodied reinforcement) to homesickness rather than to dehydration. His problem is thus no longer that he lacks perfect information and perfect instrumental rationality in any non-reductive sense. His problem is just that his brain has a low mutual information with the bodily states actually generating his reinforcement signals.

Lonnie’s brain can use active inference to resolve its situation: it can engage endogenous attention to maybe pay attention to how Lonnie’s throat feels (which may be characteristic of thirst), or it can plan a series of experiments in what to drink to see what helps Lonnie feel better. Perhaps Lonnie’s brain will just naively direct his body to go drink a glass of milk. This will make Lonnie feel worse, but having done the experiment, the reinforcement information will feed back into inference and direct Lonnie to do something more productive.

We can thus see how Lonnie, as an embodied valuer, has an objective interest based in his physical needs, how active inference with a weighted target distribution can lead him to sometimes make incorrect judgements regarding his interests (that is, incorrect inferences regarding the distal causes of his embodied reinforcement signals), and how Lonnie can make better judgements simply by improving the performance of active inference. A higher mutual information with the embodied reinforcement mechanisms can come from either improving the precision of his sensory signals themselves, or by performing experiments on potential causal attributions. When the available causal hypotheses simply fail, a higher mutual information can be obtained by allowing for more complex hypotheses.

The ultimate objective function of the system remains the same: maximize the mutual information between embodied sensory and reinforcement states and the hierarchical causal representations in the brain, by minimizing both the descriptive and the normative prediction error.

When an embodied valuer faces the need to not only track their normative precision and reward signals to form a weighted target distribution, but also to engage in precision learning and structure learning regarding the causal structures and abstraction hierarchies being weighted, we start to see an outline of why evaluative subjects can be so slippery. High-precision normative distal causes operating through reliable causal paths (ie: “being sick sucks”) can be inferred with such great clarity that basically everyone agrees about them, while low-precision normative distal causes (“Do we need God for our lives to have meaning?”) or those operating through more variable causal paths (“chocolate or vanilla?”) appear more mysterious. In fact, we can only really resolve questions about the latter in a realist way by appealing to the embodied mind, noting that the particular causal paths involved in receiving a stronger reward signal from one thing over another, all else equal, are unique to the individual.

A disembodied valuer comes closer to the traditional notion of a priori rationality: such an agent might sometimes update their target distribution, but they do so through built-in definitions of what constitute normative signals, rather than based on their embodiment. In fact, a disembodied valuer with no weighting whatsoever to their target distribution would constitute something relatively close to a perfectly unbiased rational agent — at least within this paradigm.

--

--