I wrote this three years ago, before becoming extremely depressed and developing a lot of aversiveness around it (even though I had gotten a bunch of positive feedback). As a result, it’s a bit “out of step” with the current state of the conversation, and the writing is not fully up to my current standard. I still believe the core idea could be very valuable though, and wanted to get it out there.
January 2023
This is a braindump sketching out a major change in intuition that I went through a few months ago, and that I would guess either hasn’t been experienced by most people who are thinking about AI or hasn’t been properly updated on. I’m not going to hedge as much as I naturally would, to get my point across. I have a decent amount of uncertainty of course, especially about the specifics, and I also barely know anything about the relevant fields.
Summary
There’s a model of how agency works that lots of people are explicitly or implicitly assuming that goes like “During the training of an intelligent agent, low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”. I believe this isn’t what happens in humans, but that “planning” is a bunch of superficial, distinct, socially learned behaviors itself, that are not learned through feedback about how well they fulfill your goals. I think this has some important consequences for thinking about AI - for example, it leaves us with no reason to think there is such a thing as a simple core of agency, and it leaves us less worried about inner misalignment, since sophisticated planning and reasoning is not acquired inaccessibly in the agent’s cognition, but learned by itself.
The minimal takeaway is that even if I’m wrong about my interpretations here, introspective evidence about cognition seems extremely neglected, and the fact that seemingly no one is having the debates I’m gesturing towards in this essay is crazy.
The common model
There’s a model of the emergence of general planning/agency that goes something like “Low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”. I would guess that MIRI believes this, since Yudkowksy talks about “safe” or “unsafe” tasks (with respect to AGI arising) and about how humans “generalize” from the Savannah to the moon. Even the shard theory people, who in some ways define themselves as being contra MIRI, seem to believe that a general planning algorithm gets bootstrapped out of low-level motor command planning. I would also guess Steven Byrnes believes this (see below).
This is framed as the brain planning using its self-supervised learned world model. But what I think is actually happening is that Steven has a socially learned association between being hungry / thinking about food and ordering food far in advance. (I could also imagine there being a self-image of being someone who treats themselves sometimes, or someone who is disciplined/rational enough to pursue delayed gratification - there’s a lot of possibilities). I’ve literally never thought about ordering food a week in advance, even though I’ve enjoyed cake a lot too - it’s not a socially learned affordance to me.
Calling this “planning through a world model” stretches the concept for me. It’s a much smaller world model than is portrayed here, namely (I would guess) only eating the cake is viscerally modeled and then there is a socially learned belief-about-concepts/vocalization/story of “If I order food, food will arrive in a week”. (plus imagining eating the cake / generally being hungry being associated with the behavior of ordering food).
It’s not clear to me how “order food -> food will come” is even supposed to be learned by the brain’s self-supervised learning/predictive processing or RL. The prediction error/reward comes in a week after the prediction. And if it’s somehow deduced from higher-level knowledge about the world - how did that get learned? I think this is called the “temporal credit assignment problem” in RL and neuroscience (how do we correctly identify and reward the actions responsible for long-term outcomes?) - I guess my thesis is that there is a simple explanation which fits the evidence better, which is that it actually doesn’t get solved, and humans don’t viscerally model the wider world.
I’ve gotten into the habit of trying to model what’s going on when I experience an impulse for an action that could be interpreted as ”long-term planning”, and it seems to me that it’s all actually just a bunch of superficial, distinct, socially learned behavioral patterns, rather than any planning through a world model or any general/sophisticated heuristics for accomplishing long-term goals (In the domain of long-term planning, to be clear. Obviously we have a bunch of very general heuristics for navigating our immediate physical and social environments).
An uncontroversial example is when the average person is getting close to finishing high school and (say) starts thinking about which college they want to go to - clearly they are only in a eyebrow-raisingly concept-stretching way “planning to optimize for their long-term goals” - they are looking at colleges because they feel like they have to because that’s the normal thing to do, or because they don’t have an internal affordance to do anything else, or because it feels like that’s what you do if you’re the kind of person they want to be.
So in that case, it’s probably intuitive. But I think all human behavior is like that, just in more subtle ways.
For example, on the more sophisticated end of the scale, a very agentic person with good epistemics became this way not because they’re smarter. In the best case, they have a mental motion of paying attention to small doubts, biases and known failure modes of what they are doing - but this has been internalized via an escalating internal social desire to be a smart, diligent and exceptional person (probably combined with more specific memes), not in any direct way because they’re more intelligent (of course, being more intelligent helps with learning in general, but it’s not the cause of learning any particular behavior). And in any particular case of this person planning ahead or contemplating a decision, they are internally (sequentially) applying some of their portfolio of self-socially learned patterns based on how much they get activated by the given mental context.
It would probably would be useful to add more examples, e.g. of someone reasoning about the wider world and making a decision, or of someone changing their mind, but this is already way too long.
Why do humans today look and behave so much like agents then? Why are agentic stories so easy to tell about our behavior? (“I want to get this job” etc). I think it’s that behavioral patterns that involved some goal-directed behavior got memetically selected for (assigned higher status through people acting with those patterns being more successful) over the past (tens of?) thousands of years, and so achieved higher rates of being reproduced through mimesis / imitative learning (and this process has probably intensified as people’s memetic environments became bigger and more interconnected - cultural FOOM). In other words, cultural evolution has preselected our behavioral impulses to be vaguely goal-directed for us.
More abstractly, I think the reason why agency arose out of deeply social animals is that your reward signals being dependent on other agents’ approval makes the behaviors that you can learn extremely variable, and allows selection among them to take place.[1]
An example of such a very advanced mimetic role: the social role of the entrepreneur/founder (e.g. in Silicon Valley) gets you a lot of status if successful and intrinsically requires you to have a self-narrative of goal-directed behavior - in addition to lots of smaller behavioral patterns that help you succeed at founding companies that you get acculturated to (e.g. work hard, be flexible, push people, ask for help, solve problems).
Some arguments and intuition pumps
It fits the behavioral evidence much better: the deep irrationality, inflexibility, lack of agency and status quo bias of humans, the way ideology is immediately reified, the way that we change our mind mostly when something socially significant happens to us, the way that nonsocial low-level desires/goals don’t matter for our long-term goals (e.g. most drug addicts don’t long-term plan to get drugs). In retrospect, there was very clearly a constant slight doesn’t-quite-fit in the way that humans are modeled as “His goal was X, but he was biased in way Y” - at some point, if you keep adding on epicycles and supposed imperfections to the hypothesis that human beings algorithmically plan ahead, maybe there’s just no there there after all.
It fits the introspective evidence much better - my thinking about this only started because I was trying really hard to model myself, my desires, and psychology in general, around August and September 2022. All that I feel happening in me are simple behavioral patterns easily explained as triggered by my immediate mental context - I don’t feel myself (algorithmically) planning ahead.
A priori, we would expect the first, naturally arising, agent to attain this agency in the stupidest, most hacky way possible.
This hypothesis is strictly simpler because there is no additional cognitive process of planning or modeling the wider world, so insofar as we think it fits the evidence we should prefer it.
I would guess that humans provide an untapped wealth of evidence about cognition as well, not just alignment. In the MIRI framing, AGI is seen as so alien that evidence from humans isn’t worth much, and in the strains of thought around concrete ML safety, it feels like there is a disinclination towards speculative-feeling reasoning. Stepping back, it actually seems very weird that people aren’t basing their models of general intelligence/agency/etc on the one example that we have immediate and introspective access to, and a priori I think we should expect a community of such people to be missing something really important.
[Added in 2026]: Clearly, current LLMs are in an important way agentic - they pursue coherent goals in e.g. complicated coding tasks, think of new options to try, decompose tasks into subtasks - but simply via imitating human planning patterns (and the right ones among those being reinforced in post-training), not because there was any “simple core of cognition” that was found.
Some (oversimplifying) catchphrases:
Planning and reasoning is behavior, not cognition. (socially learned behavior, to be exact).
Agency is first a behavioral, not an algorithmic/functional/internal (?), property of a cognitive system.
Human long-term planning and agency was bootstrapped out of local/internal social-symbolic maneuvering.
Miscellaneous comments and caveats
This misconception can probably also be seen as a consequence of the intelligence fetishism of rationalists and nerds in general. There’s probably something to be said about enactivism, and a more mundane conceptualization of intelligence as the ability to adapt to one’s environment or something.[2]
It feels like there is in some way a deep philosophical mistake being made in thinking that modeling the world or general planning, on the algorithmic level, is at all tractable for the brain’s learning algorithms. It seems like people mostly don’t realize that there even is another option? In retrospect, it’s very clear how the story of “the brain learns to move its limbs, then how to affect its immediate physical environment, andthenitsuddenlymodelstheentirerestoftheuniversedontworryabouthowthishappens” functioned as a semantic stopsign / fake reified abstraction / the part where my eyes glazed over in my models.
A less radical / more incremental way of putting all this might be: Long-term planning, credit assignment on very delayed reward, and modeling the wider world is computationally intractable, and so the agentic behavior of cognitive systems is much more determined by the learned “prior”[3] about what kind of actions to take, what kind of stories to act out (most importantly in practice, “this is the kind of plan that a helpful hard-working AI assistant would come up with”). We can also see this introspectively in the example of humans. So the update I am talking about is simply one of downgrading the influence of competence/feedback from reality/goal-directedness/instrumental rationality/instrumental convergence in stories of how powerful cognition would work, and upgrading the importance of the learned superficial agentic behavior/internalized prior over plans/”just do the thing you were trained to do” (this is very vague, I’m sorry).
Again trying to be properly nuanced (and stepping out of the non-hedging intuition-pumping mode the rest of the post is in), humans definitely show some interaction (more than a naive application of the ontology I’ve expounded here would imply) between the social behavior and the visceral, “small” world model. E.g. the self-narratives that define or deeply influence our social behavior also use words that refer to things in our visceral world model. So some version of the “we are planning for outcomes” story survives - it would be absurd to claim otherwise. E.g. we can imagine how having our dream job would feel in the moment, get motivated by it, and backchain to get a long-term plan by filling in the rest of the story. Some version of “general planning” gets constructed out of the building blocks of social behavior. So there is a lot more nuance, of course - my point is just that framing it as “social stories first, general-ish planning out of that”, as opposed to “general planning, distorted by social stories”, overall requires less ad hoc modifications to fit reality.
Another example (to not come across like a crank, which I’m somewhat worried about): I would never say something as unnuanced as “the classic paperclipper conception of a rogue AI could never exist because powerful AI will have high-level humanlike stories“. That would be completely insane - even humans, over the course of their lives, often get grinded down to care more about their base desires (comfort, sex, power) and less about the abstract high-level stories that originally drove them when they were young. Getting feedback from the world can still make you shift your high-level stories to accord more with your low-level reflexes. All the same things are still possible, instrumental convergence remains a very useful concept. All the same worries remain - I am just trying to shift you smoothly, give you a different “lense” to incorporate into your portfolio, not say anything radical/insane.
Some possible implications/updates (which I haven’t thought that much about):
More important:
Nothing left to elevate the hypothesis of a simple core structure of general planning or agency to our attention.
As a consequence, a much lower probability of agentic behavior “emerging” without it being in the dataset or directly rewarded in training.
Downweighting cognition’s power in general (for threat models etc), although that might be more specific to me since I was overvaluing intelligence before.
Shorter timelines since humans seemed much smarter before this change in intuition, but also maybe more specific to me as above.
I’m confused by the implications for takeoff speed, and shouldn’t put in more thought before finishing this, but probably the tendency is slower.
Somewhat lower concern about inner misalignment, because if we think reasoning/planning is a behavior, it seems much more likely that it will directly be given feedback on in the training process (as is basically already done in post-training) - and this feedback won’t be internalized as low-level, “motor” constraints on some kind of general planning module which will surely figure out a way to outmaneuver them, but the feedback will shape the reasoning, agentic behaviors and wider world model themselves - because they’re happening out in the open. (cf Externalized Reasoning Oversight?).
On balance, probably lower p(doom)?
Less important:
Situational awareness seems somewhat less natural than before, if “knowledge” like that needs to get internalized specifically, instead of being deduced through reasoning somehow.
An even higher probability (although it was already pretty high) of LLMs being involved in powerful AI, since a lot of agentic behavioral patterns (ones that systematically lead to a bigger effect on the world) are present in language.
The long-term goals emergent out of the agentic behaviors of an agentic AI could be completely unrelated to low-level reflexes (=nonagentic behavioral patterns), because in this framing they’re just separate behaviors. (for example, no human long-term plans to get candy or sex without a supporting socially learned story, and lots of humans don’t do it at all).[4]
Conclusion
This all feels quite important to me and like a lot of people might be confused about it. It’s not clear to me how much of this people already know or not, how much they “know” on some level but haven’t internalized and propagated to other beliefs, how much they have thought about it and disagree, how much they haven’t thought about it, etc.
I could imagine an animal just as smart as humans, with learning algorithms just as good, but with less hardcoded social reward - I would guess they would just get very good at moving through their immediate physical environment and meeting their hardcoded needs, but would never ever develop what we would call “general” planning or agency (cf this famous paper that argues that chimpanzees actually are this (although I’m skeptical), and is generally the closest thing to my theory here I’ve found).
see also “realism about rationality”. Also, note that I consider all this pretty orthogonal to the debate around whether human intelligence (as in, the capacity to learn to do tasks competently or something) is general or a bunch of specialized hacks - it seems like the former is likely right - I’m talking about agency, how you get from intelligence to long-term planning.
Crossposted from Substack.
I wrote this three years ago, before becoming extremely depressed and developing a lot of aversiveness around it (even though I had gotten a bunch of positive feedback). As a result, it’s a bit “out of step” with the current state of the conversation, and the writing is not fully up to my current standard. I still believe the core idea could be very valuable though, and wanted to get it out there.
January 2023
This is a braindump sketching out a major change in intuition that I went through a few months ago, and that I would guess either hasn’t been experienced by most people who are thinking about AI or hasn’t been properly updated on. I’m not going to hedge as much as I naturally would, to get my point across. I have a decent amount of uncertainty of course, especially about the specifics, and I also barely know anything about the relevant fields.
Summary
There’s a model of how agency works that lots of people are explicitly or implicitly assuming that goes like “During the training of an intelligent agent, low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”. I believe this isn’t what happens in humans, but that “planning” is a bunch of superficial, distinct, socially learned behaviors itself, that are not learned through feedback about how well they fulfill your goals. I think this has some important consequences for thinking about AI - for example, it leaves us with no reason to think there is such a thing as a simple core of agency, and it leaves us less worried about inner misalignment, since sophisticated planning and reasoning is not acquired inaccessibly in the agent’s cognition, but learned by itself.
The minimal takeaway is that even if I’m wrong about my interpretations here, introspective evidence about cognition seems extremely neglected, and the fact that seemingly no one is having the debates I’m gesturing towards in this essay is crazy.
The common model
There’s a model of the emergence of general planning/agency that goes something like “Low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”. I would guess that MIRI believes this, since Yudkowksy talks about “safe” or “unsafe” tasks (with respect to AGI arising) and about how humans “generalize” from the Savannah to the moon. Even the shard theory people, who in some ways define themselves as being contra MIRI, seem to believe that a general planning algorithm gets bootstrapped out of low-level motor command planning. I would also guess Steven Byrnes believes this (see below).
I don’t think this is what happens in humans.
My alternative
Here’s Steven Byrnes’ example of a “foresighted plan” (prinsesstårta is a type of cake, and the “plan” is to order it):
This is framed as the brain planning using its self-supervised learned world model. But what I think is actually happening is that Steven has a socially learned association between being hungry / thinking about food and ordering food far in advance. (I could also imagine there being a self-image of being someone who treats themselves sometimes, or someone who is disciplined/rational enough to pursue delayed gratification - there’s a lot of possibilities). I’ve literally never thought about ordering food a week in advance, even though I’ve enjoyed cake a lot too - it’s not a socially learned affordance to me.
Calling this “planning through a world model” stretches the concept for me. It’s a much smaller world model than is portrayed here, namely (I would guess) only eating the cake is viscerally modeled and then there is a socially learned belief-about-concepts/vocalization/story of “If I order food, food will arrive in a week”. (plus imagining eating the cake / generally being hungry being associated with the behavior of ordering food).
It’s not clear to me how “order food -> food will come” is even supposed to be learned by the brain’s self-supervised learning/predictive processing or RL. The prediction error/reward comes in a week after the prediction. And if it’s somehow deduced from higher-level knowledge about the world - how did that get learned? I think this is called the “temporal credit assignment problem” in RL and neuroscience (how do we correctly identify and reward the actions responsible for long-term outcomes?) - I guess my thesis is that there is a simple explanation which fits the evidence better, which is that it actually doesn’t get solved, and humans don’t viscerally model the wider world.
I’ve gotten into the habit of trying to model what’s going on when I experience an impulse for an action that could be interpreted as ”long-term planning”, and it seems to me that it’s all actually just a bunch of superficial, distinct, socially learned behavioral patterns, rather than any planning through a world model or any general/sophisticated heuristics for accomplishing long-term goals (In the domain of long-term planning, to be clear. Obviously we have a bunch of very general heuristics for navigating our immediate physical and social environments).
An uncontroversial example is when the average person is getting close to finishing high school and (say) starts thinking about which college they want to go to - clearly they are only in a eyebrow-raisingly concept-stretching way “planning to optimize for their long-term goals” - they are looking at colleges because they feel like they have to because that’s the normal thing to do, or because they don’t have an internal affordance to do anything else, or because it feels like that’s what you do if you’re the kind of person they want to be.
So in that case, it’s probably intuitive. But I think all human behavior is like that, just in more subtle ways.
For example, on the more sophisticated end of the scale, a very agentic person with good epistemics became this way not because they’re smarter. In the best case, they have a mental motion of paying attention to small doubts, biases and known failure modes of what they are doing - but this has been internalized via an escalating internal social desire to be a smart, diligent and exceptional person (probably combined with more specific memes), not in any direct way because they’re more intelligent (of course, being more intelligent helps with learning in general, but it’s not the cause of learning any particular behavior). And in any particular case of this person planning ahead or contemplating a decision, they are internally (sequentially) applying some of their portfolio of self-socially learned patterns based on how much they get activated by the given mental context.
It would probably would be useful to add more examples, e.g. of someone reasoning about the wider world and making a decision, or of someone changing their mind, but this is already way too long.
Why do humans today look and behave so much like agents then? Why are agentic stories so easy to tell about our behavior? (“I want to get this job” etc). I think it’s that behavioral patterns that involved some goal-directed behavior got memetically selected for (assigned higher status through people acting with those patterns being more successful) over the past (tens of?) thousands of years, and so achieved higher rates of being reproduced through mimesis / imitative learning (and this process has probably intensified as people’s memetic environments became bigger and more interconnected - cultural FOOM). In other words, cultural evolution has preselected our behavioral impulses to be vaguely goal-directed for us.
More abstractly, I think the reason why agency arose out of deeply social animals is that your reward signals being dependent on other agents’ approval makes the behaviors that you can learn extremely variable, and allows selection among them to take place.[1]
An example of such a very advanced mimetic role: the social role of the entrepreneur/founder (e.g. in Silicon Valley) gets you a lot of status if successful and intrinsically requires you to have a self-narrative of goal-directed behavior - in addition to lots of smaller behavioral patterns that help you succeed at founding companies that you get acculturated to (e.g. work hard, be flexible, push people, ask for help, solve problems).
Some arguments and intuition pumps
Some (oversimplifying) catchphrases:
Miscellaneous comments and caveats
Possible consequences
Some possible implications/updates (which I haven’t thought that much about):
More important:
Less important:
Conclusion
This all feels quite important to me and like a lot of people might be confused about it. It’s not clear to me how much of this people already know or not, how much they “know” on some level but haven’t internalized and propagated to other beliefs, how much they have thought about it and disagree, how much they haven’t thought about it, etc.
I could imagine an animal just as smart as humans, with learning algorithms just as good, but with less hardcoded social reward - I would guess they would just get very good at moving through their immediate physical environment and meeting their hardcoded needs, but would never ever develop what we would call “general” planning or agency (cf this famous paper that argues that chimpanzees actually are this (although I’m skeptical), and is generally the closest thing to my theory here I’ve found).
see also “realism about rationality”. Also, note that I consider all this pretty orthogonal to the debate around whether human intelligence (as in, the capacity to learn to do tasks competently or something) is general or a bunch of specialized hacks - it seems like the former is likely right - I’m talking about agency, how you get from intelligence to long-term planning.
Thanks to Quintin Pope for inspiring this way of framing it.
In other words, shard theory is exactly wrong.