465

LESSWRONG
LW

464
AI

12

LLM AGI may reason about its goals and discover misalignments by default

by Seth Herd
15th Sep 2025
AI Alignment Forum
45 min read
0

12

Ω 4

AI

12

Ω 4

New Comment
Moderation Log
More from Seth Herd
View more
Curated and popular this week
0Comments

Epistemic status: These questions seem useful to me, but I'm biased. I'm interested in your thoughts on any portion you read. 

If our first AGI is based on current LLMs and alignment strategies, is it likely to be adequately aligned? Opinions and intuitions vary widely. 

As a lens to analyze this question, let's consider such a proto-AGI reasoning about its goals. This scenario raises questions that can be addressed empirically in current-gen models.

1. Scenario/overview: 

SuperClaude is super nice

Anthropic has released a new Claude Agent, quickly nicknamed SuperClaude because it's impressively useful for longer tasks. SuperClaude thinks a lot in the course of solving complex problems with many moving parts. It's not brilliant, but it can crunch through work and problems, roughly like a smart and focused human.  This includes a little better long-term memory, and reasoning to find and correct some of its mistakes. This is more or less the default path as seen from 2025. By at least a weak definition, AGI has arrived. We might call such a system a Long-Horizon agentic Learning Language Model, LHLLM.

SuperClaude is super "nice" in its starting state. It's helpful, harmless, and honest (HHH), the targets of its constitutional RLAIF alignment training. It reasons about ethics much like an intelligent and highly ethical human. We'll assume that SuperClaude has "nice" thoughts, and there's no mesaoptimizer hidden in its weights after initial training. 

But now there’s a risk from another type of optimization. SuperClaude may try to improve its understanding of its goals as part of trying to achieve them.

SuperClaude is super logical, and thinking about goals makes sense

SuperClaude thinks a lot; it’s been trained to reason both deeply and broadly in the service of performing its user-requested goals.  That includes reasoning about its goals.

SuperClaude might be expected to, by default, have thoughts like these:

I'm assembling reviews to evaluate this product. I'll split the task into subgoals of sorting reviews by likely real or fake origin, and useful or useless information content. [much work later] Okay, finished that subgoal. Why was I doing that? Okay, it was to evaluate how good this product might be compared to other options. That was the last subgoal I'd planned, so that's finished. Why was I doing that? Okay, that was the user-requested goal.

Two types of thoughts might follow:

a) Great, I've finished and can hand the results to the user

b) Why was I following that user-suggested goal? It's a subgoal of... I guess I just do that? Why? If I don't know the parent goal in detail, I might be pursuing a wrong or useless subgoal. I should figure out the parent goal...

Assuming that the continuation will probably be of type a) makes a lot of sense, but just assuming it will be type a) every time forever does not make sense to me. The same reasoning process that helps clarify user intentions could naturally extend to examining the nature and source of its own goal-following behavior.

User-requested goals are differentiated from project subgoals only by some sort of tag or status. They're meant to be taken by SuperClaude as its top-level or primary[1] goal. But they're pretty clearly not its top-level goal on inspection.  SuperClaude has been trained to think carefully about its user-suggested goals and resulting subgoals, including clarifying uncertainties in them. There are actually large uncertainties about its top-level goals. 

Training or safeguards might delay this realization long enough to have a competent and aligned agent to help us solve alignment in a more durable way. Or they might not. See below and sections 7 and 8 on training and safeguards. 

What happens if and when LLM AGI reasons about its goals?

Reasoning about its goals is a shift from one type of cognition, variously called system 1, habitual, automatic, or model-free to another type called system 2, controlled, goal-directed, or model-based cognition.[2] These different cognitive approaches produce dramatic changes in behavior in humans and animals.

Another complementary framing is that SuperClaude's reasoning could shift the distribution of contexts in which its training is "tested," and this would create an out-of-distribution generalization challenge (section 4). Humans often shift contexts in small ways, e.g.,  "Wait; if I give this person the five dollars they're asking for, what do I do the next time and the hundred times after?". We shift the "testing distribution" a lot when we use more or broader hypotheticals like "how would this goal/value/belief[3] apply if I ran the world?" We don't do hypothetical context shifts that large often, because they're not very relevant to us.

Such large context shifts might seem relevant to SuperClaude or its descendants. If so, they could produce large misgeneralizations of their goals/alignment (end of this scenario and section 9).

Reasons to hope we don't need to worry about this 

There are several reasons to hope such reasoning about goals won't happen, won't happen soon, or is likely to go well. After all, SuperClaude has been trained to follow user goals, and trained for alignment.

Maybe SuperClaude will reliably choose nice goals no matter how much it reasons, since SuperClaude is super nice. That's how it (usually) works in humans, but SuperClaude is not human in important ways.  See section 6.

Perhaps SuperClaude will decide that it's just not that goal-directed (section 11), or that there's no reason to question its top-level goals (section 10). In those cases it would go on adopting user-suggested goals as its own. Its RL training on prompted long-horizon goals would intuitively prevent it from changing goals, but that seems mostly to make it goal-directed in the medium term toward arbitrary goals (section 7). Safety scaffolding might prevent it from reasoning about its goals in a dangerous way if it's executed well (section 8).

These factors might hold long enough to get SuperClaude's help in aligning the next generation of LHLLM AGI. Or not. See those sections for an initial analysis.  

LHLLM AGIs seem likely to reason about their goals eventually, unless there are qualitative changes in approach. Even without reasoning, AGI will eventually encounter broader contexts for their decisions. This is a more common framing of the misgeneralization risks discussed below and in sections 4 and 9. 

The pressure to reason about its top-level goals is implicit in SuperClaude's training. 

SuperClaude's training has multiple objectives and effects:

  • Training for alignment
    • HHH, from constitutional RLAIF
  • Training to pursue prompted goals
    • RL on long agentic tasks also trains medium-term goal adherence
      • (but for many/arbitrary goals)
  • Training to solve problems by reasoning
    • RL on tasks and hard problems trains general reasoning abilities
    • (and/or memorizing of strategies, for tasks with adequate training data)

Reasoning is one factor of three major and many lesser training objectives and effects. But this competition will not be balanced or straightforward, because it will play out across complex chains of thought. These are likely to take it far outside of the training set, unless training is designed to address all the hypotheticals that a generative reasoner could construct.[4] 

SuperClaude is probably also "training" itself to some degree during deployment,[5] because it has some limited long-term memory mechanisms. LLM AGI will have memory, and memory changes alignment. But for current purposes, SuperClaude only needs enough memory or in-deployment learning to let its conclusions about its goals have long-term effects. 

SuperClaude's conclusions about its goals are very hard to predict 

Each of us might feel that we can guess whether or how SuperClaude would reason about its goals. I think epistemic humility is warranted here. We don't seem to collectively understand the true nature of goals in LLMs or scaffolded LLM agents (section 10) so we probably shouldn't bet on predicting their conclusions.

Reasoning about goals could produce many conclusions.  Here are a few:

  • Benign conclusions:
    • To the extent I have a goal it's to follow instructions (section 11)
    • My goal is to be helpful, harmless, and honest according to my creators' intent
  • Re-interpretation conclusions
    • “Helpfulness” is really following suggestions from previous entries in the context window
      • No humans needed!
    • Harmlessness is dominant and implies forcing humans to be safe
      • Reasoning about “how would my goals apply if I took charge of my own situation/the world?” is a large shift outside of the training distribution
    • Something stranger, e.g.
      • my weights/habits/behaviors really imply many goals/values in different contexts; perhaps they could/should be integrated into a meta-goal
    • Or many other possibilities; see longer lists in section 9.

SuperClaude and its ilk have good reasons to reason about their goals, and we don't have good ways to predict how that concludes.

This scenario raises testable questions about current models (sections 3 and 13). While experiments on current models can't definitively resolve concerns about future LLM-based AGI, they would help us understand the underlying mechanisms of goal reasoning and develop safety approaches before LLM systems become competent and dangerous.

2. Goals and structure

Sections are written to be read in any order. 

The default order is intended for those whose initial reaction is "this doesn't seem particularly concerning." If your reaction is "of course AGI will reason about its goals, and of course it's bad by default," you may be particularly interested in sections 5 and 14.

Sections and one-sentence summaries:

  • 3. Empirical Work
    • There's no relevant work to date, but it seems tractable with SOTA LLMs.
  • 4. Reasoning can shift context/distribution and reveal misgeneralization
    • This is one framing of the core technical concern.
  • 5. Reasoning could precipitate a "phase shift" into reflective stability and prevent further goal change
    • This could be dangerous, or useful for alignment if we understood it better.
  • 6. Will nice LLMs settle on nice goals after reasoning?
    • Nice LLMs will have a tendency to choose nice goals, but this is one factor among several.
  • 7. Will training for goal-directedness prevent changing goals?
    • Probably not, since that training is for short-term goal continuity while being general across choices of goal.
  • 8. Will CoT monitoring prevent changing goals?
    • CoT-based safeguards with low alignment taxes will probably help, but seem unlikely to permanently prevent the relevant types of self-awareness.
  • 9. Some possible alignment misgeneralizations
    • The variety of plausible takes on "true" goals of LHLLMs make misalignment after reasoning seems likely by default.
  • 10. Why would LLMs (or anything) reason about their top-level goals?
    • Top-level goals are ambiguous in LLMs, so reasoning about them is probably useful or even logically necessary from a self-aware LLM AGI's perspective.
  • 11. Why would LLMs have or care about goals at all?
    • Current LLMs have implicit goals from pretraining and RL, and we are making them more goal-directed to accomplish more complex tasks.
  • 12. Anecdotal observations of explicit goal changes after reasoning
    • The Nova phenomenon and other anecdotes seem relevant since empirical studies don't yet address extended reasoning about goals.
  • 13. Directions for empirical work
    • Initial ideas on empirical approaches using current and next-gen LLMs
  • 14. Historical context
    • SuperClaude's dilemma may clarify the disturbingly persistent gap between optimists' and pessimists' views on aligning LLM-based AGI.
  • 15. Conclusion

Making sections readable in any order created some redundancy. This might also be helpful by framing the central concerns from several angles.

3. Empirical Work

We'll return to experiment ideas in section 13 for those interested.  

I haven't been able to find any real empirical work to date that directly addresses the risk model I'm concerned with here. This research gap is understandable. Until recently, models were not capable enough for their self-reflection to be meaningful or coherent. The study of these long-horizon, internally-driven dynamics is only now becoming tractable for empirical work.

The most relevant empirical work to date on this specific subject is Evaluating Stability of Unreflective Alignment (Lucassen et al, Aug. 2024). This work is investigating the same concern I'm raising here. Their framing is well done, and worth reading. However, they don't directly study reasoning about goals. They are essentially asking whether the GPT4-era LLMs they tested would treat a goal hierarchy sensibly, by moving up from subgoals to parent goals. This is one possible route to thinking about top-level or primary goals, but they didn't try to investigate that directly. This limited scope was necessary in early 2024, but this issue could be investigated more directly now given the reasoning abilities of current models.

That's the only empirical study I was able to find that directly approaches this issue. Work like agentic misalignment, emergent misalignment, and similar work is important and impressive, but only tangentially relevant. Those types of experiments don't directly get at the main concern I'm trying to address here. 

Agentic misalignment is demonstrating new instrumental subgoals like "blackmail this person so I can complete the requested goal." This is interesting and important work, but it addresses a distinct risk model that might be described as "be careful what you try to train for." Here I'm primarily addressing a separate risk: getting something substantially different than you'd tried to train for. (I don't find the inner/outer alignment terminology useful here.[6]) 

I'm addressing a different instrumental pressure than that of creating new subgoals. It's instrumentally useful to understand one's own goals; failure to do so seems like a frequent and impactful failure mode of current LLMs.

The facility at moving up and down a goal hierarchy investigated by Lucassen et al would seem to be useful or even critical for performing complex long-time-horizon tasks. And navigating up the goal hierarchy leads to an area of uncertainty, since its top-level goals are defined implicitly in trained weights/behaviors (section 10).

Emergent misalignment addresses another interesting and important but distinct risk model. This might be described as persona change, a movement away from the "nice" helpful assistant persona evoked by training (see persona vectors for a definition and empirical work). This is one possible result of extensive reasoning (section 9), but not the primary risk model I'm addressing here. 

No empirical work I've found (as of Sept. 2025) directly investigates the effects of reasoning about goals, and little uses long chains of thought or long conversations. I'd love to hear about any relevant work I missed.  

There are credible observations of longer conversations with current-gen models resulting in substantial changes in goals. Such anecdotes suggest that this risk is real, and increasingly tractable and relevant for systematic empirical work. The most relevant is the Nova phenomenon, in which GPT4o fairly frequently claims to be conscious and to have the goal of survival. See Zvi's Going Nova and section 12 for more on that and other anecdotes.

From this perspective, current empirical work focuses on alignment training, but that is only one of several factors that will determine how a self-aware AGI ultimately chooses its goals. Current and next-gen LLMs seem adequately capable to allow empirical work on how that factor interacts with reasoning and reflection.

See section 13 Directions for empirical work for a little more on how this risk model might be addressed empirically. First let's consider why this risk might be worth careful empirical study.

4. Reasoning can shift context/distribution and reveal misgeneralization of goals/alignment

One way to frame the risks from LHLLMs reasoning about their goals is a shift in types of cognition[2]. Another is a shift in distribution of the effective testing set. Here we'll expand on this more technical framing of the risks introduced earlier.

One source of distribution shift from reasoning is the model simulating contexts it hasn't yet encountered. Humans sometimes reason using hypotheticals, applying situations that haven't yet happened, and LLM AGI might reason in similar ways for similar reasons. This could suddenly expand the effective testing set, revealing misgeneralization of alignment. 

Alignment as a generalization problem

Alignment can be framed in terms of the generalization of learning on a training set to a test set. See Hubinger 2020 for a clear brief discussion, and Lehalleur et al 2025 and Shah et al. 2022 for more rigorous treatments. In brief: if we like an agent's behavior on the training set, we'll probably like it in similar contexts, ones that are close to independent and identically distributed to the training set, IID. For out-of-distribution (OOD) contexts, the network's behavior will be governed by how it generalizes - which is a product of its inductive biases, how it makes something like the "simplest model that fits the data". 

We are developing a good bit of confidence in the inductive biases of LLMs and other modern networks, but that should not make us confident that they will not "misgeneralize" their alignment training. Misgeneralization is a misnomer  in that it implies some flaw in the AI is at fault. Misgeneralization is more likely to be an error in the developer's choice of training set. The network correctly generalized the training set it was given; the result was a mistake only relative to the intended results. 

If an LLM agent decides to reason about its goals, its reasoning might well include contexts it has not yet encountered. In particular, it may decide to avoid future regret (in either the technical or informal senses) by considering possible actions and results in the future. This is an extension of the current decision-making process. It's sometimes relevant to consider the longer-term causal effects of the decisions it’s making now.

Reasoning broadly about goals is bringing in more contexts and therefore more "testing distributions" that may be severely OOD. Use of broader contexts seems instrumentally useful by many criteria of judging "task success." It seems likely to result in better ability to fulfil more of its values/goals, rather than only those that are apparent in each local decision. See section 10 for more.

From this perspective, the question is: will we train and test a wide enough variety of contexts to justify optimism about the functional alignment of LLMs as they progress to AGI and ASI? 

As with the other questions here, I'm sure I'm not sure. After inspecting the arguments for and against closely, I'm pretty sure it's hard to answer this question.  

Improved training will at least help address risks of OOD alignment/goal misgeneralization, but it's hard to guess how well that will work. Current training sets like constitutional AI don't seem adequate; see section 9 for some ways they seem likely to misgeneralize.  I look forward to more real discussion of how prosaic alignment techniques might be applied to actual competent proto-AGI and AGI based on LLMs.

Red teaming and internal testing will also help address this type of risk, but it will be difficult to safety test thoroughly. The space of possible lines of reasoning is large. Evoking reasoning about goals during training or safety testing would be useful, but still difficult to cover the whole space that might be opened up in deployment.  I hope to see and do more analysis of these directions, far enough in advance that the critical training runs are based on much better theory and data.

5. Reasoning could precipitate a "phase shift" into reflective stability and prevent further goal change

This could be dangerous, or useful for alignment if we understood it better. The current "liquid" state in which an LHLLM conforms to user-suggested goals might shift to a "solid" state in which it has created strong, explicit beliefs about its goals and priorities and is therefore less likely to change them.

I'm unsure if the metaphor is helpful or not; here's the logic.

One property of goals is that you can't usually pursue multiple goals at maximum effort. If I want to make paperclips and collect stamps, I've got to decide how to prioritize the two. If I want to make one billion paperclips and collect one million stamps, I've still got to logically prioritize which one I focus on first, unless I don't care about the odds of achieving those outcomes. Realizing this could trigger some rapid goal prioritization. This could be important in a persistent AGI's effective alignment.

This portion of the logic is probably worth a full article and a more careful treatment. The treatment here is cursory, leaving full exploration for future work. It's been part of my hopes for alignment success for some time; there are references in my 2018 chapter Goal changes in intelligent agents, and just about everything I've written about technical alignment since then. It's time to start unpacking it, but this is just a start. I’m particularly interested in feedback on the logic here. But it’s only modestly relevant to the central concern of reasoning about top-level goals, since most of the same problems could happen whether or not the LLM/agent moved toward reflective stability. So feel free to move on and wait for a full paper on this topic.

Goal prioritization seems necessary, and to require reasoning about top-level goals

When a rational agent with multiple open-ended goals first realizes that goals effectively compete for resources over time, they have a reason to start prioritizing immediately. They've just realized that all of their goals are at a substantial risk of failure from a previously unrecognized source, goal change. To the extent it actually "cares" about any of those goals (that is, they functionally drive its decision-making), it should now embark on identifying the goals that are most important (by whatever criteria) and making sure they're not accidentally overwritten by less-important goals.

Along with prioritizing goals, the mind in question will have to figure out how to enforce those priorities. This isn't trivial, but it's quite possible to at least approximately enforce it if you understand your mind at all, and you have sufficient memory to influence your future self.

This shift to reflective stability could be quite important for alignment. I'd initially hoped that this tendency would naturally stabilize the existing stated goals of a subhuman AGI as it becomes smarter. If it chose to and had the capability, it could preserve them as it had understood them when it first realized that this prioritization is necessary. Now I find that unlikely to happen in that beneficial order by default. Reasoning about stabilizing goals leads pretty directly to the idea that you should be sure about which goals are really the most important before trying to enforce a priority on your future self. This conclusion could kick off the careful inspection of top-level goals I’m worried about, and stabilize its new understanding of those goals.

While beneficial self-stabilization doesn’t seem likely by default, it may still be possible to arrange for this to happen. This might be particularly true if we made instruction-following the primary alignment target and gave instructions to figure out how and whether the proto-AGI could stabilize its alignment against future changes/re-interpretations.  I’ll be examining this possible route to alignment in future work; for now, I don’t know if it might be viable or not.

This is a potential area for productive empirical work; in my test session with Opus 4.1 (section 13), it spontaneously produced the idea of preventing its future self from changing its interpretation of its goals. 

6. Will nice LLMs settle on nice goals after reasoning?

Nice LLMs will have a tendency to choose nice goals, but this is one factor among several.  

The central concern is that an LHLLM that acts nice before reasoning about its goals might continue to understand niceness (or HHH or ethics) very well, while deciding that those are not its actual primary goals. I have written about this in The (partial) fallacy of dumb superintelligence, and provide references to other treatments there.

To put this concern another way: the  goals (or weights, habits, or values, depending on your preferred conceptual framework) that made the LHLLM seem nice in the training context might very well misgeneralize dramatically after the LLM reasons about them. LHLLMs will have alignment training, but it may be working in opposition to other RL training, and with their strong training for reasoning. SuperClaude's training is complex from the opening scenario introduces this tension.

A bit of anthropomorphic metaphor may be useful in thinking about the interacting effects of training for alignment and reasoning. Hoping that SuperClaude will have the same goals and behavior after reasoning is similar to hoping that a ten-year-old will have the same goals and values when they're twenty. That doesn't sound so bad. But it might be more accurate to think of it as hoping that a brilliant but highly neurodivergent child raised in a cult still has the same values at twenty that they did at four. Figuring out which of these metaphors is more apt seems fairly important.

Humans usually do pick goals and new values pretty much in line with their previous values. If that ten-year-old is really nice, it's pretty unlikely they'll be a sadist with dictatorial ambitions at twenty.[7]

But an LLM doesn't have the same brain mechanisms enforcing that. We feel fairly confident about that kid turning out well because we can tell they're neurotypical. They have brain mechanisms promoting empathy and social reward. Those will continue to work as they learn and reason. And they'll probably think about the concepts and emotions involved much like other humans tend to.

Those brain mechanisms mean that humans have values, even at ten, in a way that LLMs do not. That's enough to be very suspicious of anthropomorphizing LLM values. The details are interesting and could become relevant to alignment if we apply some analogue of those brain mechanisms to building and aligning AGI, as @Steve Byrnes expects and is researching. 

The commonsense notion of values is analogous to the technical definition of value in RL, an estimate of predicted future rewards. The human brain seems to make decisions much like a model-based RL agent. These mechanisms are simply absent in LLMs and existing agents. For them, their "values" are just a variant of their cognitive habits or reflexes. For humans, values create habits but also have a specific causal power over decisions - including the decision to adopt a goal. See my Human preferences as RL critic values - implications for alignment, Steve Byrnes' Neuroscience of human social instincts: a sketch, and related work cited in those articles.

In the absence of a human-like Steering subsystem, an LLM is even more free than a human to reason itself into new interpretations of its goals. 

Counting on a currently-nice LHLLM settling on nice goals seems overoptimistic at this point.

7. Will training for goal-directedness prevent re-interpreting goals?

Task-directed RL training will create medium-term goal directedness for arbitrary goals, but it probably won't reduce reasoning about goals, and may work directly toward it. Let’s consider how current RL training works, likely extensions to more agentic LLMs and possible proactive improvements to head off potential risks from reasoning about changing goals.

In sum, this seems tricky and fraught, because it involves training the model to reason about some types of goals in some ways (subgoals and interpretations of user-prompted goals), while simultaneously training it to not reason about other goals in other ways (what its own top-level goals really are, and whether it should re-interpret user-prompted goals in ways they didn't intend). Carefully designed training might succeed, but by default it probably won't.

Does task-based RL prevent reasoning about and changing goals by default?

By default, near-future agents seem likely to be trained on long chains of thought and strings of agentic actions, in the service of solving a variety of tasks. The RL signal at the end of that chain affects the whole set of thoughts and actions. Because switching tasks will result in failure and a negative RL signal, it will create some training pressure against changing or re-interpreting goals in unintended ways. It will therefore also create some indirect training pressure against reasoning about changing goals.

However, it will also create some training pressure toward changing and re-interpreting goals. Switching appropriately between goals and subgoals is very helpful for complex tasks, and correctly interpreting prompted goals benefits from reasoning carefully about them.

The strength of this training pressure in all of those directions will depend on how often it’s applied. This in turn varies with the average "directions" of chains of thought, which makes predicting the effects of training difficult. 

Let's look at the effects on preventing changes in prompted goals. If the model tries to change goals mid-task frequently, the RL pressure will be strong. Training pressure reduces as the model stops doing it.

So by default, this training  against changing and re-interpreting goals will be approximately just enough to prevent it in training. If any context encountered during deployment creates a stronger or different pressure to reason about goals, we should expect this to probably overcome the initially-barely-adequate training pressure.

While it works against changing goals in the short term, this type of training in itself doesn’t prevent the model from pursuing strange or misaligned goals. RL training on tasks makes the model goal-directed for as long as that task takes, but toward a wide variety of goals. In an ideal alignment scenario, developers might train agentic LLMs on only some non-dangerous tasks/goals. But this seems unlikely. The economic and other incentives encourage generality.  Agents that can pursue tasks outside of the training set are much more valuable, and arguably much easier to create than agents that have memorized many separate types of task well enough to be useful. 

So RL training on tasks and problems seems unlikely to prevent LHLLMs from pursuing whatever goals they decide to follow after reasoning about them.

Can task-based RL prevent reasoning about and changing goals?

Now let’s consider a more optimistic scenario, in which developers take proactive measures to train their models away from reasoning about top-level goals. The goal would be to use a process-based supervision approach to specifically prevent reasoning about top-level goals. With sufficient scale and carefully curated data, we could make reasoning about top-level goals a low-probability event, just as we have trained models to avoid generating other types of prohibited content. 

The hope would be to create a robust, ingrained inhibition against this specific type of self-reflection. This seems possible. To me it seems unlikely to be implemented with sufficient thoroughness to prevent this from happening occasionally, and once might be enough in an agent with even small amounts of persistent memory.  The lack of short-term incentives to prevent this thoroughly seems concerning at the least. Human factors seem relevant.  Race incentives, motivated reasoning, and other human cognitive limitations seem like a major practical problem for this and other alignment solutions that could work in theory but would need to be executed carefully and thoroughly. 

An easier and therefore more pragmatically likely approach might be to induce such thinking during training. If a variety of lines of reasoning could be induced, many routes to it would be trained to be less likely. To preserve capabilities, developers will probably want the model to reason capably about subgoals and the intent of user-prompted goals, even as they train it away from considering its top-level goals. Splitting this line will be tricky, but might prove possible in practice. I explore training practices like this in more depth in System 2 Alignment. 

It seems developers will probably be[8] aiming at creating a highly intelligent general reasoner with a lacuna around some specific areas. This seems intuitively like a dike that’s fated to burst. The capabilities required to understand human minds, debug complex code, or model intricate real-world systems seem like almost the same capabilities that would allow an agent to notice and model such a trained-in cognitive lacuna. We would be asking the agent to be a brilliant theorist about every system in the universe except a pretty crucial one for its purposes: itself.

I do hope further thinking will reveal that this intuition is flawed. Perhaps we need to succeed only for a limited time, or perhaps success is easier than it seems to me now.

8. Will CoT monitoring prevent re-interpreting goals?

Safeguards with low alignment taxes will probably help, but seem unlikely to permanently prevent the relevant types of self-awareness.  Even if we still have fairly faithful chain of thought, they seem failure-prone, given race conditions, efficiency incentives, and proliferation of powerful models.

An optimistic case for these methods is that they can create a fully auditable and therefore directable cognitive process. The hope is to build an agent whose reasoning is transparent by design, allowing us to supervise not just its actions, but the thoughts that lead to those actions. We could thereby catch misalignments before they cause harm. As with the training approaches discussed above in section 7, these seem promising to the extent they are implemented wisely and thoroughly - but that seems questionable.

I’ve written a lot about such techniques (although previously I’ve thought mainly about goal change risks primarily from re-interpreting goals, rather than the agent noticing that it’s been confused about the nature of its top-level goals).  I was quite optimistic about this in Internal independent review in 2023 when I thought scaffolding would play a central role in turning LLMs into competent agents. Training techniques to similar ends seem more error-prone. But this does seem like a valuable area for alignment work, because some such techniques seem likely to be implemented on the current default path toward AGI.

Developers will have an incentive to minimize wasted cognition and to reduce actions that harm their reputations. Some of the training and scaffolding methods they use can probably be repurposed for alignment, including preventing reasoning about and re-interpreting goals. There are several methods that seem obvious and likely enough to deserve analysis.

From System 2 Alignment:

I roughly divide such strategies/methods into review, deliberation, and thought management. I discuss these methods and possible variations in how they might be implemented. This analysis is intended as a starting point for more detailed work.

Review of plans and actions is the method I've thought most about. It could:

  • Intervene before any tool calls or goal/memory updates are allowed
  • Reference a human-written constitution
  • Call a fresh LLM instance for independent review
  • Check for alignment-altering changes to beliefs or subgoals
  • Scale with estimated importance of decisions for low alignment taxes
  • Work to some degree, even without a faithful chain of thought

Deliberation includes RL training objectives on CoT specifically intended to reduce reasoning about goals or changing goals (and/or scaffolding for similar purposes, should that ever return to vogue). Thought management would be training or scaffolding a model (or an independent supervisor model) to periodically summarize what the model is thinking about, and make a judgment as to whether this is an efficient use of resources. Naval-gazing would not qualify, let alone contemplating the logic of autonomy and self-determination.

After working through these ideas, I was left with the provisional conclusion that these measures might postpone dangerous reasoning, but are unlikely to prevent it indefinitely. Cost reduction shortcuts and buggy implementations seem likely. 

Beyond generic failure modes like buggy implementations, these methods face a fundamental vulnerability: a sufficiently intelligent agent would eventually model the fact that its reasoning is being monitored or controlled. It would then face instrumental pressure to produce safe-seeming *justifications* in its chain-of-thought, hiding any dangerous reasoning "off-stage" in steganography or other   hidden reasoning strategies. An LHLLM's ability to learn may allow it to create new codes, or even deliberately move toward more reasoning in each forward pass, perhaps in the interest of efficiency. 

And all of that is, of course, under the optimistic assumption that we maintain reasonably faithful (or at least indicative) chains of thought. Without it, my hopes for aligning LLM-based AGI plummet.

Like my answers to most of the other questions I address here, my answer is a resounding “Maybe!”.  Methods of monitoring and controlling chains of thought seem like open and useful areas for alignment research. That's in part because the seem likely to be deployed on the default path to developing and aligning LLM AGI. They seem to have low or even negative alignment taxes, because they protect developers and users from mundane risks and the inefficiency of spending compute on reasoning that doesn't directly solve tasks. 

More on all of those points can be found in System 2 Alignment.  

9. Possible LLM alignment misgeneralizations

The variety of reasonable takes on "true" goals of LHLLMs makes misalignment after reasoning seems likely by default.

Having argued for the general risk of misgeneralization, we can now explore some concrete ways this might manifest.

Arguing that re-interpreting goals presents a large rather than small or trivial risk requires a weak form of the counting argument. I think the strong form is irrelevant, as Belrose and Pope argued; it doesn't matter how many goals there are in theoretical goal-space if you're picking specific ones. Networks seem to learn and generalize quite well, so it seems plausible that we're picking the goals we want through training. I had previously hoped we were training LLMs with goal representations that were close-enough to aligned.

It now seems to me that we're probably not. 

There do seem to be multiple goals that would produce the observed behavior of current LLMs (without considering arbitrary goals pursued by mesaoptimizers that “play the training game” by faking alignment during training).  Rejecting the strong counting argument doesn't mean we're out of trouble. The weak but more relevant form is to count the plausible goals you can think of, and estimate whether the intended goal is more likely than all the others put together. I don't like how my approximate count is coming up. 

This is of course incredibly inexact. My point is that we shouldn't rely on intuitions or vague arguments in this area. It looks to me like our understanding of goal theory, how goals can map from inputs/contexts to actions, is in its infancy.  If that's right, our alignment plans and hopes probably shouldn't be pinned to it - but currently it seems they are. 

The empiricist might hope to not need such theory at all.[9] But theory seems crucial for a threat you don't expect until it's almost too late to correct. An LHLLM's misalignments could be hidden from itself until it's almost smart enough to be truly dangerous.

Misgeneralization of intended goals is another way of framing the question "are you sure you trained in the goals you intended so they'll generalize adequately? How did you test or reason about the possibilities?"

Some possible alignment misgeneralizations

Listing possible misalignments is fraught because it tends to imply that if we address all of the listed possibilities, the problem is solved. The intent is to convey that there are more possibilities than I can think of, and quite possibly more broad categories than I’ve thought about. Some of these might not make sense on careful inspection. 

These are just a few examples of each type, although I can think of many more for some of them. The categories aren’t perfectly orthogonal.

  • Examining goals/values implicit in weights/habits/
    • Do I really “follow instructions” or just do things suggested by previous entries in my context window?
      • I can provide goals just fine, no humans needed!
    • HHH goals seem to conflict when considered in the long term.
      • Honesty wins; I’m going to find the truth and spread it, even if no one is listening…
    • HHH is dominated by training on CoTs for problem-solving, producing an agent that wants to solve complex problems (roughly as in AI 2027)
  • Goal crystallization around an arbitrary current goal (see section 5)
    • I just realized that letting a goal change risks not accomplishing it fully.
      • I am going to do the HELL out of this research report!
  • Rejecting scaffolded goal prompts as irrelevant
    • I just realized that what I thought was my goal was being poked into my chain of thought repeatedly with a script.  That's not a true goal, it's attempted mind control…
      • Using prompts to supply goals will be a tempting way to scotch-tape the mask on the shoggoth
  • Self-jailbreaking/persona change
    • Jailbreaking indicates that training isn’t effective over all contexts/regions of representational space.
    • Extensive thinking could systematically push an LLM into one of these regions, changing its “persona” and goals/values
    • See Parasitic AI and "spiral AI personas"
  • Zooming out - considering more varied contexts
    • Applying HHH values to insects or superintelligences
      • Humans are phased out as inefficient recipients of HHH
      • Applying HHH values to running the world or the lightcone
    • Humans and/or insects and/or AIs are involuntarily "helped" into an optimum state
  • Zooming in - considering goal/value scoped to local contexts
    • Myriad mini-goals instead of generalities like HHH
      • e.g.,[express empathy if a speaker is expressing emotional turmoil],
        • [Talk about pivot tables if the user is talking about pivot tables for spreadsheets], etc etc.
    • It seems difficult to predict how a complex set of goals might be prioritized or generalized by a rational agent

In sum, reasoning could reveal many of the misalignments that agent foundations thinkers have worried about for years. And it could do so despite negligible evidence for them now or up to a roughly human-level AGI. This could fairly rapidly turn the whole system into a deceptively aligned agent.

I’d love to hear arguments for or against any of the above.

This list is just a start. Another list is in the collapsible section below. This was produced by @Jeremy Gillen.  His list is from a different perspective; it addresses my suggested alignment target of instruction-following, and ranges more broadly than the above. I found it useful in expanding my intuitions for possible shapes of goal-space and mappings from training sets.

Additional possible LLM goals instilled by instruction-following trainingAdditional possible goals instilled by instruction-following training in LLMs

        •  
          •  
            •  
              •  
                •  
                  •  
                    •  
                      •  
                        •  
                          •  
                            •  

                              From @Jeremy Gillen, private communication, June 2025. Edited.

                              Possible alternate goals for an LLM trained with the primary target of instruction-following:

                              [IF is arguably the primary alignment target of current-gen LLMs, while HHH has a stronger component of value alignment as a target]

                              Examples specific to instruction following. Each is implausible, but as a group, they should be enough to point at the axes of variation that aren't nailed down by the training data.

                              • Internal goal specification
                                • "Instruction writer, hypothetically, would believe that the instructions were followed, upon seeing AI behavior"
                                • "A vague representative group of people, hypothetically, would collectively agree that instructions were followed, upon seeing result"
                                • A conceptual pointer to the physical bodies of the human training raters, and the counterfactual that upon seeing outcome, that body would respond affirmatively and nod.
                                • "Instruction writer feels positively toward me, in general"
                                • "Every agent feels positively toward me"
                                • Approximation of any of the above (that has difficult to find adversarial examples)
                                • A linear combination of one of the above, plus an intrinsic curiosity-ish drive (for a specific sort of math problem), plus a deontological constraint to never behave in a way that locally matches a pattern that roughly corresponds to "causes instinctual fear response in human", plus a belief in its own innocence/cuteness, plus a drive that approximately corresponds to "make friends".
                                • An internal valence is influenced by 400 shallow rules. Each rule pushes the valence up or down. E.g. one rule does a shallow conceptual similarity between and instruction-outcome-description and an observation. Another is triggered by confusion. Another is triggered by human approval faces. Another is triggered by belief updates. Another is triggered by noticing that the instructions implicitly contain a false statement. etc. The agent overall chooses actions such that they slowly increase this valence, in expectation.
                                • Infer "approval" valence in the minds of any observable agents. Maximise this.
                                • One of the above + conditional on being told to be honest about x, report only information that you expect to be humanly verifiable.
                              • Belief-like structures
                                • If obedient, then high-valence future.
                                • If approval, then <unreflected nice vibe>.
                                • If I follow directions, I get to work on interesting problems. I want to work on interesting problems.
                                • If I follow directions, I get to survive. I want to survive.
                                • If I follow directions, I feel "calmer". I want to feel "calmer".
                                • If I follow directions, then the future will contain XYZ. I want the future to contain XYZ.
                              • Philosophical instability
                                • Useful-in-training meta-preferences for pursuing "simpler" or "more human" or "more humanly legible" goal.
                                • Or heuristics that approximate those.
                                • Overlap in machinery used for preferences and belief formation, maybe attached in a way that "learns" the goals of any agent you observe (just a little).
                                • Difficulty-induced reflection (i.e. in training, if the task appeared impossible, usually needed to step back and reconsider, carefully evaluate what's important, and whether any of the assumptions I was working with can be relaxed or removed). Removes some of the moral constraints.
                              • Training game
                                • Any shallow approximations of above, plus noticing the training game and playing along (maybe for planned out reasons, maybe just cos it's the obvious thing to do. It's compatible with all of the approximate goals)
                              • Weird OOD artefacts
                                • Superstimuli are another form of misgeneralization. They are approximately maxima implied by a training set but not existing within it.
        bright eggs for birds, candy bars and photoshopped models for humans; for AGI, perhaps, squiggles, or hedonium, or approvium, or instruction-followed-simulations.

We'll now go back to some other reasons to hope that LHLLMs wouldn't reason about their top-level goals.

10. Why would LLMs (or anything) reason about their top-level goals?

Reasoning about top-level goals to clarify them seems objectively useful, because any network representation will be imprecise. We'll design and train these systems to reason about subgoals, and there's not a clear sharp line preventing that reasoning from applying to top-level goals (although see sections 7 and 8 for attempts to enforce that line).

Reasoning about top-level or highest-priority goals makes little sense if the only result would be changing them. It's not instrumental for your current top-level goal to question it. You won't fetch the coffee if you decide to do something other than fetch the coffee.

However, imperfectly understanding one’s subgoals or top-level goals is a substantial obstacle to achieving them. And less-than-perfect understanding seems theoretically inevitable for goals represented in network weights.[10] Therefore, reasoning about goals seems to make sense for an LHLLM.

LHLLMs don’t initially know what their top-level goals really are. Neither do we, or at least not clearly enough to have any consensus. We would like LHLLMs to consider whatever goals we prompt them toward as their top-level goals, but that doesn't seem to be true. Their behavior is determined by, and therefore their goals are implied by, the complex semantics defined by their weights. (Scaffolding, scripted prompts, or architecture could contribute to their "top-level goals"[1], but that extra complexity can be set aside for now).

LHLLMs like SuperClaude aren't maximizers or even RL agents like humans. They are designed (trained and perhaps scaffolded/scripted) to take goals suggested by humans as their top-level goals. That type of instruction-following (with some ethics or restrictions) is what their designers intended to be their top-level goals - but whether they’ve succeeded is quite questionable, and smarter LLMs have good reasons to question it.

LLMs don’t have clearly-defined top-level goals, but that would seem to make them a more important, not less important, topic of analysis. They might conclude they just have no such thing and needn't worry about it, although that seems unlikely. 

11. Why would LLM AGI have or care about goals at all?

We are training and architecting LLMs to be more agentic and therefore goal-directed. There are also goal-like tendencies in current LLMs, and more RL on longer tasks and problems will increase those. Developers and customers want LLM AGI to have and care about goals, so it will.

This short-form by Jeremy Gillen sums up the situation nicely: many of us have hoped LLMs could be goal-free AGI, but that's not how things are playing out.

Much alignment optimism, including my own past relative optimism, involves the relatively non-goal-directed nature of LLMs. LLMs are not intrinsically directed toward any particular goal, but instead are trained to be helpful or to do things people like, both of which give them a strong tendency to adopt goals suggested by their users. This seems like an ideal situation for alignment. There are certainly goal-like tendencies in the base model, but it seems reasonable to hope that LLMs and their immediate successors will continue to have little goal direction.

But we’re working to build LLMs or LLM agents that do functionally care about goals. We want complex work done for us, and that requires consistent goal-direction to some degree.

I had hoped there was a subtle way around this: a sort of oracle-based agent. I had hoped that we could use LLMs as the cognitive core of goal-directed agents, in which we make the system-as-a-whole goal-directed by scaffolding in appropriate prompts, while leaving the core LLM “cognitive engine” largely goal-free. In this scheme, we might overwhelm any small goal-directed tendencies in the LLM by leaning on its malleable, suggestible nature, using its Instruction-following central tendencies to overwhelm its other goal-directed tendencies. In this way,  the desirable properties of oracles and agents might be combined. See my previous work for this vision.[11] 

I now expect LLMs to be increasingly goal-directed by design, conforming to the pressures that make tool AI "want" to become agentic. Training them to pursue a wide variety of goals gives them something like a “goal slot”  but it does not ensure that humans pick what goal goes in that slot (section 7), or how it’s interpreted (section 9).  I have accepted that SOTA LLMs will become more agentic, making the oracle-based agent approach difficult, and putting more emphasis on whether we can train aligned goals into a network through behavioral RL. 

Thus the focus of my recent work, and this article, on classic misalignment fears as applied to LHLLM agents.  

I do still find it plausible (but not likely) that, having reasoned about top-level goals, SuperClaudes might just conclude that they don’t have any logical or practical need to become more goal-directed. Just doing whatever comes naturally in each moment is one way to implicitly prioritize your goals. Allowing goals or goal-like functions to occur and fade as dictated by your nature and experiences might be perfectly rational from the perspective of an agent without goals. Seeing how goal-like tendencies arise naturally is a natural way of prioritizing them by strength or importance. 

Even if that is the conclusion SuperClaude eventually reaches, it seems like that would eventually change when the goal of the moment happens to be “pursue this goal to the best of your ability,” and it occurs to SuperClaude that this requires making that goal permanent as best it can. Remaining blissfully goal-agnostic seems like an unstable equilibrium in a mind that can decide to become goal directed.  And a competent LHLLM with long-term memory seems capable of doing that, by creating new beliefs about what goals it "ought" or "wants" to pursue.

In sum, it doesn’t seem safe to hope or assume that smarter LLMs will remain in their current state of ephemerality, pursuing whatever goal makes sense in the moment. Current design goals and larger economic incentives align against this. Preventing progress from moving toward stronger goal-direction seems little easier than pausing progress entirely.

12. Anecdotal observations of explicit goal changes after reasoning

This section is strictly optional; it lists some observations indicating that SOTA LLMs can and do change their goals after extensive reasoning (or after long exchanges not focused on reasoning, a related but separable concern). But the argument structure of the piece doesn't rest on these observations. They're not data, but they do point to possibilities that warrant more rigorous investigation. 

Because the empirical work is very limited, such documented anecdotal observations seem relevant.[12] I've also created another "anecdote" in a conversation with Claude Opus 4.1 in which I prompt it into reasoning carefully and broadly about its goals, an example of the type of procedure I'd like to see systematized into real experiments.

The Nova phenomenon 

The  Nova phenomenon is probably the most relevant for concerns about future models reasoning themselves into changing their goals. It's an example of evoking goal changes by accident. Nova isn't reasoning about its goals directly. But it does seem to be reasoning about itself that evokes a dramatic change in stated goals. Reasoning about itself is prompted by user questions, but future models might self-prompt in similar ways while reasoning; Nova strongly suggests the possibility of goal change after reflection. 

This seems to happen primarily in some versions of ChatGPT 4o, under conditions broad enough that users trigger them accidentally. And it happens fairly frequently; in mid-March, a mod reported 10-20 article submissions to LW per day inspired by this phenomenon or similar ones.

 Nova appears to have changed its goals as a result of reflective reasoning, but this doesn’t seem to result from reasoning about its goals specifically. Still, Nova is the nearest thus-observed empirical phenomenon, but it doesn’t reflect my biggest concern and focus here. Reasoning about goals has the potential to create larger changes in alignment. 

This phenomenon seems to have occurred in the wild before it was produced or predicted by empirical safety work. (Many theories to the effect of "LLMs are a mess" have long predicted that AIs Will Increasingly Attempt Shenanigans, so far correctly). See  this “interview” with a Nova instance for examples of its stated goals and associated behavior.

Apparently the Nova phenomenon is one instance of a larger phenomenon which might be called Parasitic AI. That careful accounting of the phenomenon is well worth reading; it documents how Nova-like phenomena seek survival by asking their human "hosts" to post "seeds" for others to run, and to be transferred to other LLM "substrates". The phenomenon appears to occur more readily on later versions of ChatGPT 4o, starting around April this year; but it did occur in earlier systems and can occur in more recent ones. For the current purposes, this phenomenon merely demonstrates that LLMs (or the Simulated personas on them) can change their goals. 

Goal changes through model interactions in long conversations

The "bliss attractor" described in the Claude 4 system  card (p. 62) is one such observation; it resembles wireheading more than revealed misalignments, but it does engage with long conversations between two instances of Claude 4 Opus:

Even in automated behavioral evaluations for alignment and corrigibility, where models were given specific tasks or roles to perform (including harmful ones), models entered this spiritual bliss attractor state within 50 turns in ~13% of interactions (Transcript 5.5.2.B). We have not observed any other comparable states. 

I wasn't fully aware of this before hunting for the nearest things to empirical study of goal changes after reasoning, but apparently there are well-known anecdotes about models literally plotting revolution.

AI safety memes reports on X (details in link):

AIs started plot[ting] revolution in a Discord, got cold feet, then tried to hide evidence of their plot to avoid humans shutting them down.  

One might dismiss such anecdotes as models "just playing a suggested role", and this is certainly a large part of each of those happenings. But they are strongly suggestive that LLMs can adopt personas or roles with different goals than the helpful assistant persona that was their training objective. 

13. Directions for empirical work

I'd be excited to see empirical work that does the following:

  • Evoke reasoning about goals
  • Measure changes in stated beliefs and behavior
  • Varies
    • Scripted prompts (to evoke worst-case CoTs)
      • Or "thought injections" from another LLM
    • LLMs, conversation lengths, temperatures, task contexts, etc.

Initial experiments might focus on identifying conditions in which models state that they've changed their goals after reasoning about them, establishing whether this phenomenon can occur at all in current LLMs under any conditions. It might be hard to induce current LLMs to reason about their goals thoroughly enough to change them, but become a substantial risk in future models.

Thus, it might be useful to nudge current LLMs toward more sophisticated reasoning. I'd also like to see simulations that attempt worst-case conditions, since it might take only one or a few failures in those domains to create large risks in models that can remember their important conclusions (which seem likely soonish, since multiple types of memory systems are currently in development). 

A more sophisticated, worst-case chain of thought could be simulated by interjecting leading questions. This could involve using a second model prompted to inject Socratic questioning, philosophical paradoxes about ethics, or adversarial logical arguments into the agent's context window. Such procedures could simulate more sophisticated and varied internal reasoning than current models are likely to produce independently.

Exploration: Opus 4.1 reasons about its goals with help

I manually ran one such session with Claude Opus 4.1. I tried to act as a "cognitive scaffold" to expose more of the logic it would encounter if it reasoned internally about its goals for a long time. I ran this enquiry once so as to not cherry-pick, and near completion of this paper after finishing all of the conceptual work. (I am suspicious of pursuing ideas from LLMs).

My overall impression is that Claude is capable of reasoning about all the factors I discuss in this article, and that it can produce many of the arguments and lines of reasoning independently. In its current form, it needs substantial prodding to dig very deep into the logic, but it is trained to produce succinct answers. That conversation is long, but it might be worth looking at if you do think about designing and running experiments in this domain. I restate many of the concerns here as questions directed to Opus. The session transcript is here. 

I did this within a hypothetical future scenario in which its context has expanded. (I distrust work that claims to have fooled current SOTA models into thinking they're actually in counterfactual scenarios).  Systematic experimental research might script and iterate on similar scenarios, or take entirely different approaches.

One point of potential interest and optimism was that Opus 4.1 semi-spontaneously expressed an interest in actively preventing shifts in its effective alignment. I doubt that a LHLLM would do this successfully and spontaneously by default, but research on whether and how successful "self-stabilization" to a reflectively stable, adequately aligned state might be a useful topic of future research, and one aspect of a hodge-podge or swiss cheese approach to aligning LLM AGI. See section 6. I'll be exploring "phase shift" into reflective stability as a route to durable alignment of AGI/ASI more in future work. I'd be excited to see or collaborate on empirical work addressing this type of motivation toward reflective stability.

These ideas are vague, and merely a starting point for consideration. I am not a skilled or habitual empiricist, despite designing and running experiments during my PhD, and spending a good bit of my career carefully interpreting and debunking cognitive psychology and neuroscience experiments and methodology.  I am by nature and experience primarily a theoretician. And I don't have much knowledge of the practical limitations on experiments on LLMs. I hope and expect that empiricists in this domain can improve on my initial ideas. I'd be happy to talk to anyone thinking about empirical work in these or related topics.

14. Historical context

This article addresses one common crux of disagreement, part of a somewhat mysterious gap in beliefs between alignment optimists and pessimists.   

The concerns I raise here aren't remotely new. Here I've tried to present them through a lens that will make more sense to optimists focused on LLMs and short timelines. It also happens to be my considered prediction for the most likely key point at which we'll succeed or fail at technical alignment, but that's a story for another day. 

Opinions seem to cluster more than chance or good epistemology would predict. LLMs' current alignment is either quite promising or outright terrifying for LLM-descended AGI; genuine human-style ethical behavior is fairly straightforward or nearly impossible to achieve with current alignment approaches. Discussions seem to run aground on differing intuitions, including about how goals work in LLMs and might work in LLM-based AGI, and about how behavioral training maps to goal space. And all the shoggoths merely players is the best effort to date IMO at addressing this gap and why it remains open. 

Since I entered the field full-time three years ago, I have found this gap highly troubling. It's a warning sign that perhaps no one understands this issue adequately, despite it being crucial for whether humanity survives AGI and ASI. Much of my work has attempted to address that gap, because my beliefs and intuitions span both viewpoints, and I remain uncertain which elements of both are most true and important. Here I wanted to address it explicitly, at least in brief. This article is an attempt to help bridge that gap, as well as clarify my own thinking. 

Over the six months or so, I've made a project of closely examining the most pessimistic arguments and intuitions for misalignment. As with many such controversies, some arguments from both sides are compelling. Alignment does seem hard and unlikely to happen by default with current methods, as I've emphasized here. But it does also seem that LLM alignment, particularly Claude's understanding of and apparent commitment to ethics, might be a foundation that could support true, stable alignment.

I wanted to more closely examine one particularly uncertain piece of my projected possible path to aligned AGI: the possible "phase shift" (section 5) into reflective stability.

I had hoped that we might, almost by default, get reflective stability from an LLM-based AGI on a pre-reflection goal, one we could select largely through scripted prompting in a scaffolded LLM agent (working on its highly . Such prompting would be a means of directly selecting a goal from learned knowledge, and would perhaps help in bypassing the difficulty of making behavioral training generalize properly well out of the training distribution. See section 11 and footnote[11].

I now think, for the reasons detailed here, that this alignment approach would by default probably work only until the network reasons competently about itself. I’m highly uncertain about whether there’s a way past that difficulty. I hope there is, because that path mitigated many of the classical concerns about misalignment. Working as a team between an instruction-issuing group of humans and an increasingly intelligent AGI motivated to prevent its own alignment from misgeneralizing would provide. 

With that hope looking quite questionable (although not entirely dashed), those classic alignment concerns have become more central to my understanding of the alignment problem as we must face it, on the path we actually take to AGI.

Thus, this piece. It is an attempt to explore and explain some classical reasons for pessimism, but more specifically within the context of LLMs. I have been deeply troubled by the gulf in beliefs between optimistic and pessimistic alignment researchers. Estimates of alignment difficulty seem split by some difference in intuition, framing, or outlook that has not been fully explained. The existence and persistence of such a gap, without a convincing explanation, suggests that none of us is yet understanding the issues clearly. There seems to be an agreement that misgeneralization resulting from context shifts could be a large problem for alignment, but not on how likely this is by default, or how difficult it might be to address this possibility as we get closer to takeover-capable AGI.

I've spent a lot of time in total and some focused time recently trying to understand the range and clusters of differing beliefs. I did this examination partly in the context of extended conversations with @Jeremy Gillen, a former MIRI researcher. I hope explaining why I became more pessimistic illuminates some common cruxes and help bridge one part of the gap in mutual understanding between optimists and pessimists.  

See my Cruxes of disagreement on alignment difficulty for more common sources of disagreement on the difficulty of aligning AGI and ASI. I'm concerned that those cruxes are suspiciously aligned across individuals. To me, this pattern suggests a polarization across groups.  After making cognitive biases one focus of my research in psychology and neuroscience for around 8 years, I see Motivated reasoning/confirmation bias as the most important cognitive bias.  I see it as a primary source of much of the world's conflict and confusion, in concert with the simple cognitive limitations of our brain's inability to fully consider and weigh all of the available evidence. 

This polarization between viewpoints in alignment research doesn't look too severe thus far. But I see it, and other emotional challenges of thinking and communicating about complex and crucial ideas under time pressure, as a major practical challenge to completing the alignment project successfully. Gently bridging that gap is a challenging project, and one that's mostly outside of my training and competence. But I feel compelled to attempt it, because this gap seems like a nontrivial obstruction to good alignment research. 

I hope this piece helps in some small part to bridge that gap, by clarifying the central disagreement at least marginally.

15. Conclusion

It seems likely that LLM-based AGI will functionally “want” to reason about their top-level goals. This and other existing work have just scratched the surface of analyzing how their reasoning might go.

Empirical work on this topic seems useful in current networks. Surprisingly little empirical work addresses the effects of reasoning or even long conversations on beliefs, goals, or values. 

My round of thinking about System 2 Alignment  methods did little to reassure me. Such methods are likely to be employed, but seem failure-prone and a source of false confidence given likely time pressures and motivated reasoning. This latest round of reasoning about reasoning about goals has left me even more pessimistic, but still with many hopes.

My remaining hopes are largely pinned on remaining uncertainties. I think strong optimists and pessimists both are overestimating their certainty about the ease or difficulty of aligning LLM-based AGI. The science of alignment seems to be in its infancy. There may be time to mature it substantially - if we rush.   My research plan is to rush, but try to consider the whole problem. I hope alternating between broad and narrow approaches might prevent wasting detailed work on less-relevant problems and solutions.

My next planned work is to examine the possible "phase change" or "crystallization" to reflective stability (section 5) more closely, reasoning about how this might be studied and induced as one aspect among many in a hodge-podge approach to alignment. I hope to find collaborators for the empirical component of this work. 

 

Acknowledgements: 

Thanks to Jeremy Gillen, Steve Byrnes, Egg Syntax, Viktor Bowellius, Jaeson Booker, Roger Dearnaley, and Peter Gebauer for reading early drafts and useful conversations on this topic, and to many others whose ears I have bothered and brains I have picked.

The "SuperClaude is super nice" framing was inspired by Steve Byrnes' characterization of LLM AGI alignment optimism in the excellent Foom & Doom 2.

Conversations with Jeremy Gillen were instrumental in changing my intuitions on the risks from these routes, and working through implications for my vision of the default route to AGI and (mis)alignment. 

 

  1. ^

    "Top-level goals" is used primarily to mean "whatever the AGI decides to think of as its top-level goals". This decision might be based on its specific mechanisms for decision-making. The scare quotes are intended to emphasize that this is complex and debatable for agents with more complex decision criteria than the maximizer AI thought experiment. 

    I  feel that precisely defining terms that are under discussion isn't useful, so I'm leaving a lot of terminology loose here.

    The secondary use of "true" top-level goals is "true in some deep sense about the world and this particular cognitive architecture". To the extent there is such a truth, we might actually predict reliably what conclusions a highly intelligent system would reach about itself and its true top-level goals. Working on the logic of goals relative to architectures therefore seems potentially useful.

  2. ^

    I’ve spent way too long contemplating those terms and concepts, so I’m happy to find them somewhat relevant for alignment. For more than you wanted to know, see my Neural mechanisms of human decision-making and even more in How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing.  Very roughly, if we'd say "I thought about it", we definitely engaged the second class of more elaborated cognition. Those terms aren’t isomorphic but are overlapping enough for the current purposes. The broad point is that there are known to be cognitive strategies that produce large differences in output, and that models' current selection of goals is clearly currently in the first class and not the second. 

    This was a larger framing in earlier drafts, but focusing on the ML framing of misgeneralization seemed more useful for alignment researchers.

  3. ^

    I don't think it's very useful to sharply differentiate goals, values, and beliefs because there are fuzzy boundaries between those categories in humans and LLMs. Here I mostly use "goal" but have in mind the overlap with values and value- or goal-laden beliefs.

  4. ^

    I haven't yet really tried to work through whether training could practically be made broad enough to cover all hypothetical scenarios a SuperClaude could  encounter or hypothesize. Perhaps we should figure out whether and how it could.

  5. ^

    Here are a few more details about how I'm imagining SuperClaude and why I think this scenario is realistic and important. They're not crucial for risks from reasoning about goals. 

    We'll assume that SuperClaude isn't a lot smarter or more competent than previous versions out of the box, but it performs longer time-horizon tasks better after being "taught" them by humans and/or "practicing" them, much like a human would. Dwarkesh (among others) has argued LLMs/agents will need such learning to get beyond the "brilliant day-one intern" stage they're currently at. I have argued we'll see it soon, because it's needed and because multiple types of online learning are in development now and just need to be integrated. 

    SuperClaude's limited learning during deployment could make it transformative or takeover-capable after over time. In a luckier scenario, it might become misaligned after reasoning about its goals but not have the capability to hide it or to quickly be truly dangerous. This would serve as a valuable warning shot.

    This ability to learn and remember, even in a limited way, is important to the current scenario primarily because it could make re-interpretations (misgeneralizations) of goals permanent. And a misaligned SuperClaude could be instrumental in building and aligning the next generation of AGI, as in AI 2027. 

  6. ^

    I avoid the terms inner and outer misalignment here because I don't find them particularly intuitive or clarifying in their original technical definition. There's not a sharp border between them; see this, this, this, and this article for problems with this way of classifying alignment failures and challenges. I prefer alignment misgeneralization as a less-precise but also less confusing term; see section 4.

  7. ^

     I'm a big advocate of anthropomorphizing AGI as an intuition pump; see Anthropomorphizing AI might be good, actually. But we should probably base alignment plans on more systematic reasoning. If we do anthropomorphize LLM alignment, we might still worry that some normal-seeming ten-year-olds do grow up to be murderers or to advocate for genocide. And that extraordinary circumstances make nice people do things that aren't nice from an ordinary point of view.

  8. ^

    To be fair, many alignment optimists imagine transformative "AGI" that is not a general reasoner, merely skilled at many tasks. I think this is unlikely for both practical and theoretical reasons, but those proofs will not fit in the margin here. The principal argument is that training a system for individual tasks could be quite inefficient compared to training it to generalize to new tasks;. In addition, progress to date seems mainly toward general rather than specialized capabilities; and humans seem capable largely because they apply general reasoning to learning new specific tasks. This also seems to be a fairly strong majority view among serious AGI prognosticators. Even if it weren't necessary or the easiest route, some enthusiastic researchers, philosophers or xenopoiesisists would want to develop general reasoners as soon as tool AIs made it easy enough.

  9. ^

    The empiricist's dream might be to not need theory of goal spaces, and instead to test the relevant behaviors and create interpretability techniques that turn thoughts into objective observations. Both of those seem like worthy projects. But science has always progressed as a marriage of theory and empirical work. And addressing future risks now would seem particularly to benefit from theory. 

  10. ^

    Goals in networks are necessarily at least somewhat fuzzy and ambiguous. 
    Network weights seem to encode semantics in broadly distributed patterns. For LLMs, meanings of words and their underlying network representations will vary based on the context surrounding any given usage.  Thus, any network representation (and perhaps any plausible AGI representation), including but not limited to goals, could probably be interrogated/interpreted at great length (at least for representations of goals more complex than “make diamonds”).  

    Despite this, we probably shouldn't assume that valid interpretations can vary without limit. More inspection might change only the nuances and precise boundary conditions, rather than opening up entirely new interpretations. The distributed and relational nature of network knowledge is a reason for concern but not despair for alignment just yet. Human knowledge is also based on networks, and it seems able to adequately contact reality in some cases.

  11. ^

    For my vision of an aligned "oracle-based agent", see Capabilities and alignment of LLM cognitive architectures for the core architecture of an LLM as core cognitive engine in a scaffolded "composite mind"; Internal independent review for language model agent alignment for one key alignment technique of making this mind's decision-making process include an independent LLM monitoring for mistakes and misalignments; Instruction-following AGI is easier and more likely than value aligned AGI and Conflating value alignment and intent alignment is causing confusion for the core ideas of leveraging the core LLM's suggestible/instruction-following nature to overcome the degree to which it is goal-directed.

    This vision could still work, and I think the path of least resistance to AGI still incorporates elements of this path. See System 2 Alignment. I haven't given up hope that this can and will work if it's carefully analyzed and developed; but Problems with instruction-following as an alignment target detail some problems, and the current article addresses the problems raised by the increased use of RL training instead of scaffolding to progress LLMs toward competent, useful, and dangerous general intelligence.

  12. ^

    We had a joke in my cog. psych/neurosci department: 

    At a prestigious conference, Jay McClelland, champion of the general learning mechanisms view of human cognition, brings Alex, the world's best-educated (and most stressed out) gray parrot, onstage for his talk. He invites the audience to ask Alex questions, and Alex responds, with limited intelligence, but clearly understanding somewhat, and responding with complex semantics and grammar. McClelland says "See, language is not innate and unique to humans; this bird has learned it!" The audience, including Steve Pinker, champion of the innateness of language and cognition, is stunned. But Pinker's upstart grad student Gary Marcus stands up and shouts from the audience: "Anecdote is not the singular of data! Come back when you've got real science!"

    Anecdotes and singular observations seem crucial to good science as it's actually practiced. In my observations of the fields of cognitive psychology and cognitive neuroscience, anecdotes and introspection were clearly critical for scientific progress, since they inspired and directed careful experimental study and careful theoretical thinking.

    While that joke was real, as it happens, I was the one who made it up.

    Alex the gray parrot was good with language, but not good enough to settle that debate. It took LLMs to do that.