Irrationality as a Defense Mechanism for Reward-hacking

Ashe Vazquez Nuñez

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. It's related to my previous post, but should be readable as a standalone.

Remark: I'm not yet familiar enough with the active inference literature to be sure that the issues I bring up haven't been addressed or discussed. If you think my characterisation of the state and flaws of the theory are missing something substantial, I'd love to know.

Introduction

In the theory of active inference, agents are described as having a set of internal states that interact with external states (the world) through a membrane of intermediate states, such as the senses. I'm currently exploring how agents are able to exhibit approximations of external reference that allow them to stay alive in the real world. They achieve this even though they only have access to the statistical proxy of their internals, which they could easily reward-hack without optimising the external states at all.

One of active inference's weaknesses is that it struggles to model agents' uncertainties about their own preferences. I here propose a potential explanation for why agents are conflicted about these preferences. This perspective posits agents' seeming inconsistency and irrationality about their goals as a mechanism that protects them from reward-hacking their internal states.

Internal reward-hacking

Consider the following question:

What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals?

Take Clippy as an example, whose goal is stated as maximising the amount of paperclips in the world. Since Clippy only has internal reference, it could represent this goal as "I observe that the world has as many paperclips as it could possibly have". I'm wondering what in Clippy's system saves it from "winning at life" by hooking its sensors up to a cheap simulator that generates an infinite stream of fictional paperclips for it to observe.

Do agents just have good priors?

An elegant answer to the problem of internal reward-hacking is that agents come pre-equipped with suitable priors about their internal states. In active inference, agents seek to update their beliefs and act on the world such that their observations fit their priors as closely as possible. The space of "good" priors for agents' internal states is very small. However, evolutionary pressures have selected for agents with priors that are conducive to their survival. According to active inference, agents attempt to manifest these favourable priors through action, which makes the priors function as preferences.

Unfortunately, the claim that evolutionarily fine-tuned priors do all the work to prevent internal reward-hacking seems lacking to me, because in practice we are uncertain about our own feelings and preferences. We don't actually have locked-in, invariant preferences, and it's unclear to me how active inference explains this; preferences are usually encoded as priors over observations, but ironically these are never updated.^[1]

Active inference thus implicitly assumes agents to be consistently, definitively settled on their preferences. Agents are only uncertain about the external states and about how their actions and senses will interact with those states. Within those unknowns, they seek to optimise for the observations that they are certain they prefer. I don't think this assumption is warranted. In fact, I have been considering the possibility that agents' uncertainty about their own preferences is an important instrument for increasing their (bounded) rationality.

Internal inconsistency as a tool for rationality

Consider the example I used in my last post of a hypothetical person, Alice, who wants to maximise "success". In that example, Alice avoids applying to a prestigious university because rejection would decrease her internal perception of success. She instead applies to a worse university that she is sure to get into, as this will certainly increase her success-o-meter.

Suppose instead that Alice feels a twinge of guilt not applying to the prestigious university, as this could be perceived as "loser" behaviour by her friend. This guilt may motivate her to apply anyway, even though the action lowers (in expectation) her internal perception of success. Here, the mixed optimisation of two distinct goals: "I perceive myself as maximally successful" and "I perceive myself as someone that my friend thinks is maximally successful", yields behaviour that actually makes Alice more successful.

In Free Energy Minimisers (FEMs) from active inference, preferences are usually described as fixed priors over the space of observations. One possible model for Alice's behaviour is that each action is chosen with respect to one of two sets of priors. The priors she chooses to satisfy in a given action are sampled from some distribution over priors that represents the degree to which she identifies with conflicting preferences. In practice, Alice now doesn't resemble a consistent FEM, but she has become more aligned with respect to the external goal. Her mixed strategy between preferences can be seen as hedging against her top choice of priors being unfit.

What's next: competition between preferences

I would like to distinguish this concept of inconsistent preferences from mental motions such as compartmentalisation. For instance, suppose an agent learns to calculate the derivative of a function (f+g) by having separate^[2] parts of itself calculate the derivatives of f and g and then adding the results. This motion could be seen as the agent using subagents' outputs to solve a problem. However, these "subagents" are not imbued with goals of their own. They're more like tools that the agent deploys to break the problem down into manageable components.

My guess is that people's uncertainties about their preferences are better represented as meme(plexe)s competing with each other for attention. The memes that live to be observed in minds are those that could be seen as agentically pursuing survival and reproduction.^[3] Internal preferential inconsistency would thus be analogous to the sub-parts in the above example optimising to convince the agent that they are "useful" for calculating derivatives and should be kept around.^[4]

Sub-processes and compartmentalisation as tools to increase rationality are not controversial ideas. The more contentious claim I'm ideating is that even conflicting agentic sub-processes — harboring goals that are unaligned with those of the larger agent — can still be useful for increasing agentic rationality with respect to external goals. I aim to formalise and explore this hypothesis in an empirical or mathematised setting.

^{^}
There's a good reason for never updating priors over observations. If agents' preferences could update, they would gradually move towards preferring states that are more likely, even if these aren't fruitful for their continued existence. The function of the fixed priors is to give agents a vision of the world they are willing to execute actions to manifest; these are preferences.
^{^}
this potentially includes separation across time
^{^}
For example, successful memes, like catchy songs, have a tendency to get their hosts to spread them to other people.
^{^}
This goal could functionally be the same as actually being good at calculating derivatives, but it doesn't have to be. For example, if the agent wants the derivative to be high, then a sub-part may gain a competitive advantage by overestimating the answer of the derivative of f. It may eventually convince the agent to employ two copies of itself to calculate the derivatives of both f and g, replacing the other sub-part.

Unfortunately, the claim that evolutionarily fine-tuned priors do all the work to prevent internal reward-hacking seems lacking to me, because in practice we are uncertain about our own feelings and preferences. We don't actually have locked-in, invariant preferences, and it's unclear to me how active inference explains this; preferences are usually encoded as priors over observations, but ironically these are never updated.

I don't think actinf proposes that all preferences are locked-in and invariant, just that the "deepest" priors---related to body temp., humidity, etc.---are (see Section 9). Imo if you're talking about a deep (hierarchical) actinf agent, then all the forward predictions are kinda-preferences, to varying degrees; the slower to update, deeper layers are more preference-y and the faster to update layers closer to sensory input are more belief-y. There's some interesting discussion of this here.

Active inference thus implicitly assumes agents to be consistently, definitively settled on their preferences.

So with that in mind, I think I'd disagree with this and agree with something more like "active inference assumes agents are settled in the preferences that are necessary to keep them alive, but not in the preferences that are necessary to bring those states about."

That said, though, I think your overall idea is interesting. If you're thinking about subagents and superagents in terms of actinf, you might want to check out this paper.

I only have a cursory knowledge of heirarchical active inference, but I have noticed from squinting at it from afar that it seems to afford some types of flexibility about preferences that I would value in a model. For instance, it seems to include mechanisms for making different preferences salient to the model at varying times. I'm also interested in your point that hierarchical structures can describe preferences that are increasing levels of "locked in" for the model. Thanks for tipping me to that, and thanks for the resources!

LESSWRONG
LW