The nature of doing interdisciplinary research is that you have to know a little about a lot of things. Unfortunately, it's hard to tell whether you know a little about something or are just misinformed about it.
Much of my agent-ey thinking is inspired by and seeks to adequately model human cognition, but I realised I have no solid understanding of the relationship consciousness has to cognition. They're definitely not the same, since most processes I could describe as cognitive don't materialise in my consciousness. However, almost all the examples I find insightful feature conscious decision-making. This suggests that what cognition is without consciouness is opaque to me.
I'd like to clarify that most people do reward hack themselves all the time; I think reward hacking is the default. What I'm grappling with here is rather why people don't reward-hack themselves literally to death.
In terms of "doing the real thing", could you elaborate on what "the real thing" means in the frame I'm using of external vs. internal states, or in some other frame you resonate with?
I only have a cursory knowledge of heirarchical active inference, but I have noticed from squinting at it from afar that it seems to afford some types of flexibility about preferences that I would value in a model. For instance, it seems to include mechanisms for making different preferences salient to the model at varying times. I'm also interested in your point that hierarchical structures can describe preferences that are increasing levels of "locked in" for the model. Thanks for tipping me to that, and thanks for the resources!
There are two questions that currently guide my intuitions on how interesting a (speculative) model of agentic behavior might be.
1. How well does this model map onto real agentic behavior?
2. Is the described behavior "naturally" emergent from environmental conditions?
In a recent post I argued, in slightly different words, that seeing agents as confused, uncertain and illogical about their own preferences is fruitful because it answers the second question in a satisfactory way. Internal inconsistency is not only a commonly observable behavior (c.f. behavioral economics); it is additionally an emergent strategy to protect against agents finding adversarial internal representations of external goals. I called this internal reward-hacking.
What I think I failed to communicate is that my final thoughts on preferences competing with each other also come from reflecting on question 2. It makes sense that agents might weigh different memes representing possible preferences for improved decision-making, but this doesn't answer how the memes behave with respect to that agent.
Memes are subject to selection pressures not unlike animals. Moreover, memes can co-adapt or compete with each other, and they can arguably engage in active transmission such as influencing their host's behavior to enable their spread. It therefore feels reasonable for models of human cognition to imbue memes with some agency, enabling them to perceive and interact each other and their hosts (Claude's literature review indicates that at least some memeticists disagree with me on this point).
A model that aspires to respect these agentic properties of memes would thus be incomplete without describing what incentives memes have to participate in a cognitive system, and how those incentives shape the system's emergent properties. That's why I think it's insufficient to see potential preferences as impartial, passive sub-modules that agents deploy to enable their own rationality. Instead, we should always attempt to answer "what's in it for the memes?"
more conscious: deciding what move to make in a chess game.
less conscious: The physical act of playing a move. You can move the piece in a conscious, deliberate way, but in practice the movement usually follows "automatically" from the high-level decision of what move to play.
not conscious: reflex reactions.