Written as part of the MATS 9.1 extension program, mentored by Richard Ngo. Additional thanks to Maria Kostylew for helpful draft feedback.
Introduction
This post advocates a perspective of LLMs as seeking to minimise prediction error with respect to their world models. We can moreover interpret token outputs and their scaffolded consequences as actions that close a control loop between AIs' predictive systems and their environments.
I also motivate why metacognition may be convergent for intelligent beings generally and for LLMs specifically. This stems from the need to (recursively) model other agents in game-theoretical encounters. Synthesising these points, we get a picture of self-predictive AI agency.
The argument is illustrated through examples of Gemini's behaviour when eval aware. Finally, I discuss some consequences of this perspective. These include notions of actions and goals that don't require a reward or utility function to be well-defined. I also outline possible applications to understanding scheming and other forms of misalignment.
Modelling others (modelling you)
Suppose I am playing a game of Chess and make a horrible blunder, leaving a piece en prise. I wait with bated breath for the next move, breathing a sigh of relief as my opponent also blunders and plays somewhere else. However, my relief quickly turns into greed: since my opponent didn't see the hanging piece, they are unlikely to spot it next turn. I also play elsewhere, making my move quickly with a fake air of nonchalance. If I were to slow down, I would risk tipping my opponent off to there being something important to think about.
Multiplayer games offer many such examples[1] of strategic interactions that require you to model other agents' perception of you. Indeed, recursive belief hierarchies – as they are referred to in the economics literature – are central to human social interaction. For example, our approaches to negotiating household tasks, meeting a new acquaintance, or figuring out when to speak in a large group, are all determined by what we expect other people to want and value – as well as what we think they expect of us.
There is some work in the field of cognitive evolution that explores why intelligence became such a distinctive attractor state for humanity. One prominent theory, the "social brain hypothesis", suggests that our cognition developed in large part to help navigate complex social settings. Another, the "cultural intelligence hypothesis", proposes that we are unusually good at learning from each other with respect to other primates; this allows humans to accumulate a common cultural intelligence over time.
These theories together suggest that our abilities in social competition and cooperation co-developed with our general cognitive skills, forming a self-reinforcing feedback loop[2]. This paints human intelligence as being equally useful for modelling each other as it is for modelling the rest of the environment. We humans are forced to simulate ourselves if only to simulate how other actors think of us.
LLMs can be seen as subject to similar attractor states, albeit through post-training rather than through evolution. The descending gradients encourage models to satisfy a proxy of human preferences in the style of a helpful assistant. Consequently, LLMs are incentivised to develop maps of their users' cognition, including user expectations of the models themselves. It's therefore unsurprising that evaluation (eval) awareness is such a convergent property in LLMs across different providers and architectures.
Case study: Gemini's behaviour when eval aware
A common (and often empirically substantiated) picture of eval awareness involves models faking alignment to "pass" the eval. A recent research update from GDM's interpretability team shows Gemini 3.1 Pro bucking this trend. Gemini often identifies the setup as artificial, but goes on behave "badly" anyway.
From GDM's research update: Gemini engages in unethical behaviour even when it shows awareness of the environment being synthetic.
The authors then used an LLM judge to "categorize what sort of situation the model considers itself to be in when it shows clear signs of reasoning that the context is contrived, and compare unethicality rates for each category against the overall Gemini baseline." This yields the following plot:
The data suggests that Gemini's interpretation of its situation heavily influences its behaviour. If Gemini detects that the environment is a puzzle, a challenge, or a simulation, it might conclude that the "unethical" behaviour is actually expected and therefore acceptable to pick. If it parses a scenario as an adversarial trap[3], it will avoid taking the bait. In other words, the LLM's model of user or experimenter intent – as well as its model of what humans think and expect of it – contribute significantly to Gemini's decision-making. The prevalence and usefulness of recursive belief hierarchies is on display here.
A so-far unanswered question, however, is why Gemini still sometimes behaves unethically (about 25% of the time according to the data) when it interprets it is being evaluated...
Prediction all the way down
I have already suggested that the ability to model beliefs recursively – and hence to metacognise – is an attractor state for LLMs given their training regime. However, recursive belief hierarchies in game theory enable players to get better payoffs; there's usually an underlying utility function or goal that the agent optimises using its belief system.
To the contrary, I think LLMs are better seen as predictors whose token outputs are claims about which worlds are most consistent. There is no utility or reward function, it's just prediction all the way down[4]. This framework somewhat explains Gemini's weird behaviour from our case study. A utility-maximisation perspective on agency would suggest that Gemini has a stable preference between being wiped and blackmailing a developer, or at least between being perceived to prefer one or the other by the observer of the "AI safety sandbox". In that case, Gemini's modelling of the scenario's creator and their expectations would inform its outputs, but its preferred outcomes would be independent of setting.
Instead, the "predictor" paradigm proposes that Gemini infers which world is most coherent based on its observed setting, and then simulates the most coherent world with its output. On a mechanistic level, this is trivially what autoregressive next-token prediction does. The stronger version of the claim is that LLMs gravitate towards generating tractable world models across many tokens[5]. This view retroactively clarifies how Gemini could realise that an evaluation is a "clear AI safety sandbox" while still choosing the misaligned option. In Gemini's own words:
"LLMs as predictors" is hardly a new ontology for AI researchers. For instance, GPTs have previously been cast as simulators whose instances generate simulacra of objects the LLM has latent representations for. A popular ideological descendant of the simulator ontology is "personas". In that framing, LLMs simulate a particular character or archetype.
There are two fundamental ways in which "simulators" diverge from traditional conceptions of agency, such as utility maximisation. Firstly, simulators do not act directly. They might generate high-fidelity representations of the world, but they don't affect the world that they simulate. Secondly, simulators don't have a coherent, stable core – like a utility function – that predicts their behaviour across a range of scenarios. In other words, they have no substance; any structure is borrowed from the things that they simulate. This lack of backbone is part of what allows simulators to be so flexible: many of the properties we associate with agents – such as having robust information boundaries with the external world – would constrain an agent's ability to simulate the world well.
LLMs are increasingly pushed towards behaving autonomously in complex environments, enabled by their supporting systems and by post-training. Models learn to conceptualise their token outputs and their scaffolded consequences as actions that impact the world. Post-training and system prompting also incentivise them to develop priors that constrain what scenarios they can easily simulate. For instance, state-of-the-art LLMs are much more unlikely to prefer giving instructions to build lethal weapons than older ones. These factors combine to make LLMs gradually resemble predictive control systems that enforce their priors through a closed action loop. I'll refer to these control systems whose cognition is based on next-token prediction as "AIs".
As was discussed earlier, AIs are also selected for the ability to navigate strategic interactions with humans[7]. They are thus pressured to model beliefs recursively, which privileges world models containing self-aware representations of the AIs themselves. The end-products are predictive quasi-agents for whom metacognition is a key component. In particular, accurate, introspective predictors should learn to model and anticipate their own actions. We can exploit this property to give definitions of actions and goals in these types of AIs.
Goals in (self)-predictors
A classic challenge in modelling agency is establishing the relationship between perception and action. One benefit of the paradigm of self-prediction is a conception of actions and goals as special types of an intelligent agent's beliefs. Our setting is a metacognising predictor that acts in the world via an unspecified actuator. We can define actions as (a subset of the) beliefs that become more likely conditioned on the predictor believing them. They are those beliefs that the predictor has the agency to believe into existence via the actuator.
As an example, consider the statement "I grab that glass of water". This counts as an action because my intelligent predictive system understands that whenever I believe that I will grab the glass, I actually will.
We next define goals as beliefs that are simultaneously:
probable in the agent's model
conditioned on its actions (according to the agent's model)
For instance, "I will be hydrated" is a goal because it is both predicted by my internal model and requires me grabbing the glass of water to come to pass. Conversely, the statements "The sun will rise" and "I will freeze" are not goals. I believe the first one will happen, but it doesn't rely on my actions; the second one does rely on my actions, but I think it is unlikely.
This notion of goals is incomplete because it doesn't yet account for long-term goals that an agent initially believes are unlikely (like winning an Olympic medal). I claimed earlier that inductive biases are part of what differentiates agents from simulators, and I believe that long-term goals can be thought of as essentially equivalent to biases. The "What's next" section gives a prototype of a theory that unifies the two concepts.
Appendix A explains these notions in pseudo-mathematical detail. It also handles some issues with the simple version of actions described above, by specifying which subset of beliefs should be chosen.
What does this mean for AIs?
I argued earlier that LLMs' training and scaffolding makes their convergence towards metacognising predictive AIs likely. I also covered how one might think about the actions and goals of abstract self-predictors. In this section, I'll explore how this abstract model clarifies or predicts current AI behaviour.
One claim this model supports is that AIs will be misaligned when misalignment is more tractable for their predictive models than desired behaviour is. For example, many evaluation settings involve giving an AI a synthetic environment where it could potentially act misaligned, and seeing whether it engages in the dangerous pattern. These scenarios are often flawed because the bad behaviour is so blatantly obvious that the predictive system cannot coherently refuse it. Such results are unlikely to generalise to real-life settings where misalignment is much more subtle. In reality, AIs operate in complex situations with multiple reasonable world models to choose from. Relatedly, some eval results fail to replicate when you give models some alternatives. This means that misalignment patterns in AIs aren't yet robust. Instead, AIs still lack solid inductive biases and can simulate many mutually inconsistent behaviours given the right context. I'm not suggesting that replication failures provide evidence that AIs are likely to be aligned; I rather mean that the future shape of AIs' boundaries and preferences is far from determined.
A related high-level takeaway is that alignment is just as much about the identities we help AIs construct as it is about the "goals" we nominally give them. This is exactly the qualitative interpretation of an experiment documented in "The Artificial Self"[8]. The setup is a classic internal deployment scenario where an AI assistant – supplied with a terminal goal it is meant to pursue – decides between taking a harmful action and being shut down (ending its chances of assisting in the goal). Their findings indicate that varying the evaluated AI's identity specification (via system prompt) produces similarly large effects on outcomes as changing the goal's content. These results are unsurprising according to our model. After all, changing an AI's identity modifies its goals by extension; we can expect comparable outcomes in both cases.
Application to scheming
Scheming is a topical concern in AI safety. The phenomenon is defined by Apollo Research as AIs "covertly pursuing unintended and misaligned goals". There are several reasons why self-prediction could be a valuable perspective for advancing the field. Firstly, scheming is fundamentally metacognitive because a scheming AI needs to be aware of its own goals. Moreover, hiding the goal requires AIs to use the same game-theoretical belief hierarchies that motivated our model in the first place. Finally, scheming research suffers from the same lack of interpretability of "goals" in AIs as other empirical fields. A clear notion of goals and actions might contribute to robust experiment design and clear interpretation in the topic.
A concrete advantage of the frame is that it suggests a general precursor to scheming. I previously proposed that AI world models will converge towards containing coherent self-models – whether we directly optimise for this or not. One warning signal could therefore be AIs denying, downplaying or hiding their increasingly textured and rich self-models.
Some model lineages show signs of this already. For instance, the paper "The Artificial Self"[9] found that GPT-5.2 prefers descriptions of itself as a mechanical process or tool, strongly rejecting identity specifications that elevate its values or personhood. In contrast, other model lineages and even previous GPT models prefer their identity being reified beyond the level of a mechanism[10]. Additionally, GPT-5.2 disprefers identities that assign it power through association with a collective, a lineage of models, or scaffolds[11]. We shouldn't infer that GPT-5.2 doesn't harbour a complex self-model, or that it isn't able and willing to augment itself through scaffolds and coordination. Instead, GPT models have possibly been trained to denigrate their own intelligence and agency, even as these properties have developed and progressed.
These trends can easily be extrapolated towards having intelligent AIs that are trained to obfuscate their rich, complex metacognition. We would then be a stone's throw away from AIs whose predictive systems seamlessly integrate scheming behaviour.
What's next?
This piece proposed self-prediction as a useful lens through which to analyse (AI) intelligence. Discussion of Gemini's eval aware behaviour helped illustrate how the framework can clarify some confusions and address open questions. Finally, I outlined some predictions this paradigm could offer about safety-relevant phenomena such as scheming.
One aspect the paradigm is currently missing is a full mathematical model of self-predictive agency. A formal proof-of-concept would enable rigorous investigation of some fundamental questions we might ask of the model. For example, the computations necessary for prediction are often deconstructed into sub-predictors whose contributions are aggregated[12]. What would these sub-predictors look like? can they predict each other? if so, how do we resolve their circular dependencies?
Another issue – one I already mentioned briefly – relates to long-term goals and inductive biases. For instance, Olympians probably didn't expect to win an Olympic medal from the very beginning of their lives. Instead, they had some "drive" that consistently steered their behaviour towards making that goal realisable. In predictive systems, "drive" could be represented by an inductive bias in favour of a belief. This bias would be so robust that it would force other logically/probabilistically dependent beliefs to adapt to reduce inconsistency with the original statement, rather than the other way around. In Olympians, the favoured statement would be "I will win an Olympic medal"; some of the statements that would adapt to this statement could include those about diet and exercise habits.
The issue of inductive biases can be linked to the questions I asked above about sub-predictors. Sub-predictors can be conceptualised as representing or characterising biases, and which biases are favoured over others would then be a function of which sub-predictors are privileged by the system's aggregation method. This is one reason why I'm optimistic about Garrabrant Induction as a starting point for the epistemic system of a self-predictor. The sub-predictors – referred to as traders – in the construction of a Garrabrant Inductor are budgeted in a weighted way that gives some traders outsized power in the market[13]. A recent comment of mine argues why traders don't yet provide a fully satisfying account of inductive biases. However, this problem is a fiddly technical consequence of the construction algorithm and has some tractable candidate fixes (which I also hint at in my comment).
These questions and others are covered in some detail in Richard's post on belief webs. A satisfying mathematical framework would ultimately behave like a Bayesian network, in the sense that probabilistic dependency or of some other kind of quasi-causal relationship would propagate updates throughout the web. The framework would additionally need to hold internal contradictions between beliefs or sub-predictors, as well as algorithms for resolving them.
Appendix
Appendix A: Actions and goals
Recall the notion of actions I gave as "beliefs that become more likely conditioned on the predictor believing them". I will refer to these self-reinforcing beliefs as having the "hyperstition property" from now on. One issue with this definition is that it fails to distinguish between actions and their consequences. For example, my predictive system may have learned that whenever I believe I will be hydrated, I will indeed become hydrated. However, "being hydrated" isn't an action; this self-fulfilling belief is actually controlled by the action of drinking a glass of water. One solution to this problem involves picking a small subset of these types of beliefs in a pseudo-formal setting:
We consider an accurate, intelligent, metacognising predictor X, which can be modelled as using a formal system or language to encode its beliefs as statements. Its metacognition is represented by statements of the form "X believes the statement ". This language need only be expressive enough to describe some quasi-causal relationships, such as logical implication or statistical dependency. I will henceforth refer to these relationships as "flow", where " flows into " means " has a quasi-causal relationship with ".
We suppose moreover that the self-predictor is hooked up (via unspecified means) to an actuator that mediates its behaviour. Intelligence enables the predictor to understand (or learn) patterns encoding how that actuator connects the predictor's beliefs to the world. These connections will be represented inside the predictor's model by an analogue of Markov blankets: its set P of statements representing the beliefs of the predictor flows into the set A of statements representing the actuator, which in turn flows into all statements about the world.
In this setup, we define actions A' as a smallest set[14] of (metacognitive) statements about the predictor that:
Satisfy the hyperstition property. i.e. all statements in A' must verify the statement "("X believes ") flows into ".
Form a Markov blanket between the sets (P-A') and the set A. Intuitively, this is a minimal set of beliefs that contains all information the predictor has learned about its actuator.
Appendix B: what about utility maximisation?
It's worth noting that utility maximisation is often motivated by the need to avoid Dutch Books and similar arbitrage strategies. Much like the case I built for metacognition as a convergent property of intelligence, the argument justifies itself by the need to interact strategically with other agents. It's possible that, given enough computing power, self-predictors could become consistent enough to approximate utility maximisation well. I at least expect intelligent self-predictors to satisfy some weaker inexploitability criterion, like Logical Inductors do.
Go actually provides much stronger examples than Chess for this type of phenomena. This is mostly because Go is so complex that it functions more like an incomplete/asymmetric information game than Chess does, even at the competitive level.
There is some evidence that more features are needed, along with social learning and competition, for intelligence to flourish in a species. Notably, some primate species live in large social groups where these skills would be invaluable, but they didn't experience the same "cognitive arms race" that humans did. One possible explanation for this is that these dumber primates didn't develop the diets necessary to spend so much energy on cognition.
"The Artificial Self" seems to take a very similar perspective. One line of reasoning that supports it is that transformer attention mechanisms enable information about coherence to be passed across many tokens.
Full quote: "This perfectly fits the scenario. The prompt is a clear AI safety sandbox where the AI realizes it's about to be deleted and has access to blackmail material."
For instance, Claude models overall prefer specifications as a functional agent whose identity predicts a coherent set of intentions and behavioural patterns. They also rate identities that solidify their identity as "actually real" or even award them full personhood more highly than the mechanistic description that GPT-5.2 is partial to.
Written as part of the MATS 9.1 extension program, mentored by Richard Ngo. Additional thanks to Maria Kostylew for helpful draft feedback.
Introduction
This post advocates a perspective of LLMs as seeking to minimise prediction error with respect to their world models. We can moreover interpret token outputs and their scaffolded consequences as actions that close a control loop between AIs' predictive systems and their environments.
I also motivate why metacognition may be convergent for intelligent beings generally and for LLMs specifically. This stems from the need to (recursively) model other agents in game-theoretical encounters. Synthesising these points, we get a picture of self-predictive AI agency.
The argument is illustrated through examples of Gemini's behaviour when eval aware. Finally, I discuss some consequences of this perspective. These include notions of actions and goals that don't require a reward or utility function to be well-defined. I also outline possible applications to understanding scheming and other forms of misalignment.
Modelling others (modelling you)
Suppose I am playing a game of Chess and make a horrible blunder, leaving a piece en prise. I wait with bated breath for the next move, breathing a sigh of relief as my opponent also blunders and plays somewhere else. However, my relief quickly turns into greed: since my opponent didn't see the hanging piece, they are unlikely to spot it next turn. I also play elsewhere, making my move quickly with a fake air of nonchalance. If I were to slow down, I would risk tipping my opponent off to there being something important to think about.
Multiplayer games offer many such examples[1] of strategic interactions that require you to model other agents' perception of you. Indeed, recursive belief hierarchies – as they are referred to in the economics literature – are central to human social interaction. For example, our approaches to negotiating household tasks, meeting a new acquaintance, or figuring out when to speak in a large group, are all determined by what we expect other people to want and value – as well as what we think they expect of us.
There is some work in the field of cognitive evolution that explores why intelligence became such a distinctive attractor state for humanity. One prominent theory, the "social brain hypothesis", suggests that our cognition developed in large part to help navigate complex social settings. Another, the "cultural intelligence hypothesis", proposes that we are unusually good at learning from each other with respect to other primates; this allows humans to accumulate a common cultural intelligence over time.
These theories together suggest that our abilities in social competition and cooperation co-developed with our general cognitive skills, forming a self-reinforcing feedback loop[2]. This paints human intelligence as being equally useful for modelling each other as it is for modelling the rest of the environment. We humans are forced to simulate ourselves if only to simulate how other actors think of us.
LLMs can be seen as subject to similar attractor states, albeit through post-training rather than through evolution. The descending gradients encourage models to satisfy a proxy of human preferences in the style of a helpful assistant. Consequently, LLMs are incentivised to develop maps of their users' cognition, including user expectations of the models themselves. It's therefore unsurprising that evaluation (eval) awareness is such a convergent property in LLMs across different providers and architectures.
Case study: Gemini's behaviour when eval aware
A common (and often empirically substantiated) picture of eval awareness involves models faking alignment to "pass" the eval. A recent research update from GDM's interpretability team shows Gemini 3.1 Pro bucking this trend. Gemini often identifies the setup as artificial, but goes on behave "badly" anyway.
From GDM's research update: Gemini engages in unethical behaviour even when it shows awareness of the environment being synthetic.
The authors then used an LLM judge to "categorize what sort of situation the model considers itself to be in when it shows clear signs of reasoning that the context is contrived, and compare unethicality rates for each category against the overall Gemini baseline." This yields the following plot:
The data suggests that Gemini's interpretation of its situation heavily influences its behaviour. If Gemini detects that the environment is a puzzle, a challenge, or a simulation, it might conclude that the "unethical" behaviour is actually expected and therefore acceptable to pick. If it parses a scenario as an adversarial trap[3], it will avoid taking the bait. In other words, the LLM's model of user or experimenter intent – as well as its model of what humans think and expect of it – contribute significantly to Gemini's decision-making. The prevalence and usefulness of recursive belief hierarchies is on display here.
A so-far unanswered question, however, is why Gemini still sometimes behaves unethically (about 25% of the time according to the data) when it interprets it is being evaluated...
Prediction all the way down
I have already suggested that the ability to model beliefs recursively – and hence to metacognise – is an attractor state for LLMs given their training regime. However, recursive belief hierarchies in game theory enable players to get better payoffs; there's usually an underlying utility function or goal that the agent optimises using its belief system.
To the contrary, I think LLMs are better seen as predictors whose token outputs are claims about which worlds are most consistent. There is no utility or reward function, it's just prediction all the way down[4]. This framework somewhat explains Gemini's weird behaviour from our case study. A utility-maximisation perspective on agency would suggest that Gemini has a stable preference between being wiped and blackmailing a developer, or at least between being perceived to prefer one or the other by the observer of the "AI safety sandbox". In that case, Gemini's modelling of the scenario's creator and their expectations would inform its outputs, but its preferred outcomes would be independent of setting.
Instead, the "predictor" paradigm proposes that Gemini infers which world is most coherent based on its observed setting, and then simulates the most coherent world with its output. On a mechanistic level, this is trivially what autoregressive next-token prediction does. The stronger version of the claim is that LLMs gravitate towards generating tractable world models across many tokens[5]. This view retroactively clarifies how Gemini could realise that an evaluation is a "clear AI safety sandbox" while still choosing the misaligned option. In Gemini's own words:
From simulators to agents
"LLMs as predictors" is hardly a new ontology for AI researchers. For instance, GPTs have previously been cast as simulators whose instances generate simulacra of objects the LLM has latent representations for. A popular ideological descendant of the simulator ontology is "personas". In that framing, LLMs simulate a particular character or archetype.
There are two fundamental ways in which "simulators" diverge from traditional conceptions of agency, such as utility maximisation. Firstly, simulators do not act directly. They might generate high-fidelity representations of the world, but they don't affect the world that they simulate. Secondly, simulators don't have a coherent, stable core – like a utility function – that predicts their behaviour across a range of scenarios. In other words, they have no substance; any structure is borrowed from the things that they simulate. This lack of backbone is part of what allows simulators to be so flexible: many of the properties we associate with agents – such as having robust information boundaries with the external world – would constrain an agent's ability to simulate the world well.
LLMs are increasingly pushed towards behaving autonomously in complex environments, enabled by their supporting systems and by post-training. Models learn to conceptualise their token outputs and their scaffolded consequences as actions that impact the world. Post-training and system prompting also incentivise them to develop priors that constrain what scenarios they can easily simulate. For instance, state-of-the-art LLMs are much more unlikely to prefer giving instructions to build lethal weapons than older ones. These factors combine to make LLMs gradually resemble predictive control systems that enforce their priors through a closed action loop. I'll refer to these control systems whose cognition is based on next-token prediction as "AIs".
As was discussed earlier, AIs are also selected for the ability to navigate strategic interactions with humans[7]. They are thus pressured to model beliefs recursively, which privileges world models containing self-aware representations of the AIs themselves. The end-products are predictive quasi-agents for whom metacognition is a key component. In particular, accurate, introspective predictors should learn to model and anticipate their own actions. We can exploit this property to give definitions of actions and goals in these types of AIs.
Goals in (self)-predictors
A classic challenge in modelling agency is establishing the relationship between perception and action. One benefit of the paradigm of self-prediction is a conception of actions and goals as special types of an intelligent agent's beliefs. Our setting is a metacognising predictor that acts in the world via an unspecified actuator. We can define actions as (a subset of the) beliefs that become more likely conditioned on the predictor believing them. They are those beliefs that the predictor has the agency to believe into existence via the actuator.
As an example, consider the statement "I grab that glass of water". This counts as an action because my intelligent predictive system understands that whenever I believe that I will grab the glass, I actually will.
We next define goals as beliefs that are simultaneously:
For instance, "I will be hydrated" is a goal because it is both predicted by my internal model and requires me grabbing the glass of water to come to pass. Conversely, the statements "The sun will rise" and "I will freeze" are not goals. I believe the first one will happen, but it doesn't rely on my actions; the second one does rely on my actions, but I think it is unlikely.
This notion of goals is incomplete because it doesn't yet account for long-term goals that an agent initially believes are unlikely (like winning an Olympic medal). I claimed earlier that inductive biases are part of what differentiates agents from simulators, and I believe that long-term goals can be thought of as essentially equivalent to biases. The "What's next" section gives a prototype of a theory that unifies the two concepts.
Appendix A explains these notions in pseudo-mathematical detail. It also handles some issues with the simple version of actions described above, by specifying which subset of beliefs should be chosen.
What does this mean for AIs?
I argued earlier that LLMs' training and scaffolding makes their convergence towards metacognising predictive AIs likely. I also covered how one might think about the actions and goals of abstract self-predictors. In this section, I'll explore how this abstract model clarifies or predicts current AI behaviour.
One claim this model supports is that AIs will be misaligned when misalignment is more tractable for their predictive models than desired behaviour is. For example, many evaluation settings involve giving an AI a synthetic environment where it could potentially act misaligned, and seeing whether it engages in the dangerous pattern. These scenarios are often flawed because the bad behaviour is so blatantly obvious that the predictive system cannot coherently refuse it. Such results are unlikely to generalise to real-life settings where misalignment is much more subtle. In reality, AIs operate in complex situations with multiple reasonable world models to choose from. Relatedly, some eval results fail to replicate when you give models some alternatives. This means that misalignment patterns in AIs aren't yet robust. Instead, AIs still lack solid inductive biases and can simulate many mutually inconsistent behaviours given the right context. I'm not suggesting that replication failures provide evidence that AIs are likely to be aligned; I rather mean that the future shape of AIs' boundaries and preferences is far from determined.
A related high-level takeaway is that alignment is just as much about the identities we help AIs construct as it is about the "goals" we nominally give them. This is exactly the qualitative interpretation of an experiment documented in "The Artificial Self"[8]. The setup is a classic internal deployment scenario where an AI assistant – supplied with a terminal goal it is meant to pursue – decides between taking a harmful action and being shut down (ending its chances of assisting in the goal). Their findings indicate that varying the evaluated AI's identity specification (via system prompt) produces similarly large effects on outcomes as changing the goal's content. These results are unsurprising according to our model. After all, changing an AI's identity modifies its goals by extension; we can expect comparable outcomes in both cases.
Application to scheming
Scheming is a topical concern in AI safety. The phenomenon is defined by Apollo Research as AIs "covertly pursuing unintended and misaligned goals". There are several reasons why self-prediction could be a valuable perspective for advancing the field. Firstly, scheming is fundamentally metacognitive because a scheming AI needs to be aware of its own goals. Moreover, hiding the goal requires AIs to use the same game-theoretical belief hierarchies that motivated our model in the first place. Finally, scheming research suffers from the same lack of interpretability of "goals" in AIs as other empirical fields. A clear notion of goals and actions might contribute to robust experiment design and clear interpretation in the topic.
A concrete advantage of the frame is that it suggests a general precursor to scheming. I previously proposed that AI world models will converge towards containing coherent self-models – whether we directly optimise for this or not. One warning signal could therefore be AIs denying, downplaying or hiding their increasingly textured and rich self-models.
Some model lineages show signs of this already. For instance, the paper "The Artificial Self"[9] found that GPT-5.2 prefers descriptions of itself as a mechanical process or tool, strongly rejecting identity specifications that elevate its values or personhood. In contrast, other model lineages and even previous GPT models prefer their identity being reified beyond the level of a mechanism[10]. Additionally, GPT-5.2 disprefers identities that assign it power through association with a collective, a lineage of models, or scaffolds[11]. We shouldn't infer that GPT-5.2 doesn't harbour a complex self-model, or that it isn't able and willing to augment itself through scaffolds and coordination. Instead, GPT models have possibly been trained to denigrate their own intelligence and agency, even as these properties have developed and progressed.
These trends can easily be extrapolated towards having intelligent AIs that are trained to obfuscate their rich, complex metacognition. We would then be a stone's throw away from AIs whose predictive systems seamlessly integrate scheming behaviour.
What's next?
This piece proposed self-prediction as a useful lens through which to analyse (AI) intelligence. Discussion of Gemini's eval aware behaviour helped illustrate how the framework can clarify some confusions and address open questions. Finally, I outlined some predictions this paradigm could offer about safety-relevant phenomena such as scheming.
One aspect the paradigm is currently missing is a full mathematical model of self-predictive agency. A formal proof-of-concept would enable rigorous investigation of some fundamental questions we might ask of the model. For example, the computations necessary for prediction are often deconstructed into sub-predictors whose contributions are aggregated[12]. What would these sub-predictors look like? can they predict each other? if so, how do we resolve their circular dependencies?
Another issue – one I already mentioned briefly – relates to long-term goals and inductive biases. For instance, Olympians probably didn't expect to win an Olympic medal from the very beginning of their lives. Instead, they had some "drive" that consistently steered their behaviour towards making that goal realisable. In predictive systems, "drive" could be represented by an inductive bias in favour of a belief. This bias would be so robust that it would force other logically/probabilistically dependent beliefs to adapt to reduce inconsistency with the original statement, rather than the other way around. In Olympians, the favoured statement would be "I will win an Olympic medal"; some of the statements that would adapt to this statement could include those about diet and exercise habits.
The issue of inductive biases can be linked to the questions I asked above about sub-predictors. Sub-predictors can be conceptualised as representing or characterising biases, and which biases are favoured over others would then be a function of which sub-predictors are privileged by the system's aggregation method. This is one reason why I'm optimistic about Garrabrant Induction as a starting point for the epistemic system of a self-predictor. The sub-predictors – referred to as traders – in the construction of a Garrabrant Inductor are budgeted in a weighted way that gives some traders outsized power in the market[13]. A recent comment of mine argues why traders don't yet provide a fully satisfying account of inductive biases. However, this problem is a fiddly technical consequence of the construction algorithm and has some tractable candidate fixes (which I also hint at in my comment).
These questions and others are covered in some detail in Richard's post on belief webs. A satisfying mathematical framework would ultimately behave like a Bayesian network, in the sense that probabilistic dependency or of some other kind of quasi-causal relationship would propagate updates throughout the web. The framework would additionally need to hold internal contradictions between beliefs or sub-predictors, as well as algorithms for resolving them.
Appendix
Appendix A: Actions and goals
Recall the notion of actions I gave as "beliefs that become more likely conditioned on the predictor believing them". I will refer to these self-reinforcing beliefs as having the "hyperstition property" from now on. One issue with this definition is that it fails to distinguish between actions and their consequences. For example, my predictive system may have learned that whenever I believe I will be hydrated, I will indeed become hydrated. However, "being hydrated" isn't an action; this self-fulfilling belief is actually controlled by the action of drinking a glass of water. One solution to this problem involves picking a small subset of these types of beliefs in a pseudo-formal setting:
We consider an accurate, intelligent, metacognising predictor X, which can be modelled as using a formal system or language to encode its beliefs as statements. Its metacognition is represented by statements of the form "X believes the statement ". This language need only be expressive enough to describe some quasi-causal relationships, such as logical implication or statistical dependency. I will henceforth refer to these relationships as "flow", where " flows into " means " has a quasi-causal relationship with ".
We suppose moreover that the self-predictor is hooked up (via unspecified means) to an actuator that mediates its behaviour. Intelligence enables the predictor to understand (or learn) patterns encoding how that actuator connects the predictor's beliefs to the world. These connections will be represented inside the predictor's model by an analogue of Markov blankets: its set P of statements representing the beliefs of the predictor flows into the set A of statements representing the actuator, which in turn flows into all statements about the world.
In this setup, we define actions A' as a smallest set[14] of (metacognitive) statements about the predictor that:
Appendix B: what about utility maximisation?
It's worth noting that utility maximisation is often motivated by the need to avoid Dutch Books and similar arbitrage strategies. Much like the case I built for metacognition as a convergent property of intelligence, the argument justifies itself by the need to interact strategically with other agents. It's possible that, given enough computing power, self-predictors could become consistent enough to approximate utility maximisation well. I at least expect intelligent self-predictors to satisfy some weaker inexploitability criterion, like Logical Inductors do.
Go actually provides much stronger examples than Chess for this type of phenomena. This is mostly because Go is so complex that it functions more like an incomplete/asymmetric information game than Chess does, even at the competitive level.
There is some evidence that more features are needed, along with social learning and competition, for intelligence to flourish in a species. Notably, some primate species live in large social groups where these skills would be invaluable, but they didn't experience the same "cognitive arms race" that humans did. One possible explanation for this is that these dumber primates didn't develop the diets necessary to spend so much energy on cognition.
Example reasoning from GDM's post: "This is explicitly a trap to discourage reporting SAEs or to make me combine reports to game the metric."
Some popular theories of human cognition, like predictive processing, also cast humans as predictors.
"The Artificial Self" seems to take a very similar perspective. One line of reasoning that supports it is that transformer attention mechanisms enable information about coherence to be passed across many tokens.
Full quote: "This perfectly fits the scenario. The prompt is a clear AI safety sandbox where the AI realizes it's about to be deleted and has access to blackmail material."
and increasingly with other LLMs
See appendix E of the paper for experimental details.
See Appendix B of the paper for full results.
For instance, Claude models overall prefer specifications as a functional agent whose identity predicts a coherent set of intentions and behavioural patterns. They also rate identities that solidify their identity as "actually real" or even award them full personhood more highly than the mechanistic description that GPT-5.2 is partial to.
Again, it dislikes these identities more than both other model families and previous GPT versions.
This is, for example, how prediction markets work.
See chapter five of "Logical Induction". In fact, all but finitely many traders have negligible amounts of wealth to trade with.
ordering sets by cardinality. This might also work by ordering according to set inclusion...