This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo.
Introduction
This is a follow-up to my previous post. There, I suggest that inconsistency over preferences could be an emergent feature of agentic behaviour that protects against internal reward-hacking. In this piece, I expand on the possible forms of this inconsistency. I propose two properties a promising theory of (internally inconsistent) agency might have, and describe how their confluence hints at a compelling proto-model. I additionally guess at which branches of current human knowledge may be fruitful for mathematising or otherwise un-confusing these properties.
Characterising internal inconsistency
I already shared how an agent's preferences could be seen as competing with each other for its attention. I will expand on this further. Another important feature of preferences is they remain latent in agents' cognition most of the time, becoming salient at specific moments such as when their stability is threatened.[1]
Preferences are often latent
Suppose that Bob identifies as a good family member who contributes to a harmonious household. This might manifest as him valuing symbolic behaviour such as sharing and accepting gifts and favours within his family. Bob's wellbeing is moreover plausibly dependent on this element of his self-image. He therefore treats it as a preferred state (i.e. goal), which means conditioning his behaviour or some of his other beliefs to serve it.
However, this preference does not need to be actively satisfied all the time. Bob is likely to feel its effects strongly when he visits his hometown to see his family, but it won't significantly affect, for instance, his daily shopping decisions.
Assume simultaneously that Bob identifies as a vegan who doesn't harm animals, and that this self-concept similarly causes him to modify his beliefs and actions to satisfy it. This does affect Bob's daily choices about consumption, but it may not be relevant to many other aspects of his life.
A good model of preferences in an agent would suggest what kinds of stimuli would "instantiate" awareness about these preferences in Bob's mind, such that the preferences take a prominent role in his next action. I'll tentatively define preferences that come up consistently in an agent's cognitive process as having high "salience".
Preferences compete in power struggles
My previous post discussed one toy model for internal inconsistency. Rather than having a fixed set of preferences, an agent could have a probability distribution that describes its confidence in each set of candidate preferences. Each action would satisfy either a preference sampled from the distribution, or a combination of preferences that is weighed by confidence.
I'm not ruling out randomness as a tool for designing decision procedures across possible preferences, especially since it may be necessary down the line to avoid certain problems like Arrow's impossibility theorem. However, it seems likely that preferences are given confidence levels in ways that depend at least somewhat predictably on context. Taking Bob's daily shopping as an example, his preference as a vegan will consistently be awarded more "confidence" in such situations than that of being a good family member.
One take on a preference's bestowed confidence is that it gives it power over competing preferences. This may not be of much interest in situations where only one of Bob's sensibilities is instantiated by the environment, and the other remains indifferent. However, it's much more compelling in cases where both of his preferences generate strongly held, mutually irreconcilable suggestions, such as when his family prepares him chicken for Christmas dinner.
The power of a preference is partially determined by its salience
No matter whether Bob chooses to eschew his principles by accepting his family's Christmas dinner or to shun his family's hospitality by respecting his vegan intuitions, one of his preferences will have "lost" against the other. Even if Bob makes the "correct" decision that minimises some hypothetical objective function, he will feel guilt or some other negative feeling with respect to the preference that lost out.[2] Importantly, guilt is likely to persist in his mind and encourage him to satisfy that preference through other means. He may donate an offset to an animal welfare charity or compensate his rudeness with increased efforts to show gratitude and goodwill towards his family.[3]
This example illustrates that preferences can become salient in cognition as a power-grab inside their host. Preferences that are well-established and dominant are those that are often visible and are weighed highly; these define the "fundamental" makeup of the agent. Preferences can thus be said to interact dynamically and almost politically within their environment. This pattern of fluid conflict is what motivates me to think of them as competing sub-agents rather than as sub-processes that are stochastically "instantiated" by the ambient conditions.
Some inspirations for model-building
These reflections suggest that a model of preferential inconsistency could cast preferences as agents engaging in power struggles for relevance, or salience, in the larger agent's cognition. An important question is therefore how these preferences would be organised into decision procedures, and how we can model these dynamics' effects on the agent and sub-agents. Here are some possibly useful modelling tools (this list may grow as I edit this piece).
In my last post, I inaccurately claimed that active inference doesn't provide any tools for updating preferences. A commenter helpfully pointed out that hierarchical active inference does enable "lower-level" preferences in the hierarchy to be updated. Moreover, I speculate that the structure of hierarchy could plausibly lend itself for use in thinking about power struggles.
Cultural evolution and related theories model the transmission and adaptation of culture. The field offers one of the most developed environments for studying how concepts can be seen as behaving agentically, or at least as being subject to selection pressures. Unfortunately, I haven't found many tools that endow concepts within a person's head with any agency, though neuroscience or cognitive science may have made progress that I'm not familiar with.
For instance, humans' preference for maintaining their temperature within acceptable bounds tends to only take up cognitive space when we're too cold or too warm.
There are many other ways for the conflict to be resolved. Bob could, for instance, intuitively demote his preference for being a good family member because "his family are disappointing non-vegans anyway"; this would represent his good-family-member preference losing power.
This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo.
Introduction
This is a follow-up to my previous post. There, I suggest that inconsistency over preferences could be an emergent feature of agentic behaviour that protects against internal reward-hacking. In this piece, I expand on the possible forms of this inconsistency. I propose two properties a promising theory of (internally inconsistent) agency might have, and describe how their confluence hints at a compelling proto-model. I additionally guess at which branches of current human knowledge may be fruitful for mathematising or otherwise un-confusing these properties.
Characterising internal inconsistency
I already shared how an agent's preferences could be seen as competing with each other for its attention. I will expand on this further. Another important feature of preferences is they remain latent in agents' cognition most of the time, becoming salient at specific moments such as when their stability is threatened.[1]
Preferences are often latent
Suppose that Bob identifies as a good family member who contributes to a harmonious household. This might manifest as him valuing symbolic behaviour such as sharing and accepting gifts and favours within his family. Bob's wellbeing is moreover plausibly dependent on this element of his self-image. He therefore treats it as a preferred state (i.e. goal), which means conditioning his behaviour or some of his other beliefs to serve it.
However, this preference does not need to be actively satisfied all the time. Bob is likely to feel its effects strongly when he visits his hometown to see his family, but it won't significantly affect, for instance, his daily shopping decisions.
Assume simultaneously that Bob identifies as a vegan who doesn't harm animals, and that this self-concept similarly causes him to modify his beliefs and actions to satisfy it. This does affect Bob's daily choices about consumption, but it may not be relevant to many other aspects of his life.
A good model of preferences in an agent would suggest what kinds of stimuli would "instantiate" awareness about these preferences in Bob's mind, such that the preferences take a prominent role in his next action. I'll tentatively define preferences that come up consistently in an agent's cognitive process as having high "salience".
Preferences compete in power struggles
My previous post discussed one toy model for internal inconsistency. Rather than having a fixed set of preferences, an agent could have a probability distribution that describes its confidence in each set of candidate preferences. Each action would satisfy either a preference sampled from the distribution, or a combination of preferences that is weighed by confidence.
I'm not ruling out randomness as a tool for designing decision procedures across possible preferences, especially since it may be necessary down the line to avoid certain problems like Arrow's impossibility theorem. However, it seems likely that preferences are given confidence levels in ways that depend at least somewhat predictably on context. Taking Bob's daily shopping as an example, his preference as a vegan will consistently be awarded more "confidence" in such situations than that of being a good family member.
One take on a preference's bestowed confidence is that it gives it power over competing preferences. This may not be of much interest in situations where only one of Bob's sensibilities is instantiated by the environment, and the other remains indifferent. However, it's much more compelling in cases where both of his preferences generate strongly held, mutually irreconcilable suggestions, such as when his family prepares him chicken for Christmas dinner.
The power of a preference is partially determined by its salience
No matter whether Bob chooses to eschew his principles by accepting his family's Christmas dinner or to shun his family's hospitality by respecting his vegan intuitions, one of his preferences will have "lost" against the other. Even if Bob makes the "correct" decision that minimises some hypothetical objective function, he will feel guilt or some other negative feeling with respect to the preference that lost out.[2] Importantly, guilt is likely to persist in his mind and encourage him to satisfy that preference through other means. He may donate an offset to an animal welfare charity or compensate his rudeness with increased efforts to show gratitude and goodwill towards his family.[3]
This example illustrates that preferences can become salient in cognition as a power-grab inside their host. Preferences that are well-established and dominant are those that are often visible and are weighed highly; these define the "fundamental" makeup of the agent. Preferences can thus be said to interact dynamically and almost politically within their environment. This pattern of fluid conflict is what motivates me to think of them as competing sub-agents rather than as sub-processes that are stochastically "instantiated" by the ambient conditions.
Some inspirations for model-building
These reflections suggest that a model of preferential inconsistency could cast preferences as agents engaging in power struggles for relevance, or salience, in the larger agent's cognition. An important question is therefore how these preferences would be organised into decision procedures, and how we can model these dynamics' effects on the agent and sub-agents. Here are some possibly useful modelling tools (this list may grow as I edit this piece).
In my last post, I inaccurately claimed that active inference doesn't provide any tools for updating preferences. A commenter helpfully pointed out that hierarchical active inference does enable "lower-level" preferences in the hierarchy to be updated. Moreover, I speculate that the structure of hierarchy could plausibly lend itself for use in thinking about power struggles.
Cultural evolution and related theories model the transmission and adaptation of culture. The field offers one of the most developed environments for studying how concepts can be seen as behaving agentically, or at least as being subject to selection pressures. Unfortunately, I haven't found many tools that endow concepts within a person's head with any agency, though neuroscience or cognitive science may have made progress that I'm not familiar with.
For instance, humans' preference for maintaining their temperature within acceptable bounds tends to only take up cognitive space when we're too cold or too warm.
For active inference enthusiasts, this paragraph can be rephrased quite directly in terms of prediction error.
There are many other ways for the conflict to be resolved. Bob could, for instance, intuitively demote his preference for being a good family member because "his family are disappointing non-vegans anyway"; this would represent his good-family-member preference losing power.