Misalignments and RL failure modes in the early stage of superintelligence

shu yang

Introduction

📎 A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of advanced problem-solving systems that excel in specific areas (e.g., superintelligent language translators or engineering assistants). Nevertheless, a general-purpose superintelligence remains hypothetical, and its creation may or may not be triggered by an intelligence explosion or a technological singularity. [1]

After the large model research explosion in the first half of 2025, we now possess numerous Large Reasoning Models (LRM) capable of solving various complex problems. These models have learned to reflect and communicate in anthropomorphic tones through large-scale reinforcement learning training [2] . Additionally, more powerful large model-based agents exhibiting enhanced capabilities in both planning and tool utilization have emerged, such as computer user agents [3] that can control our computers to help us manage travel plans and bookings. It appears that we are really in the early stages of what is referred to as superintelligence, where AI would far surpass human cognitive abilities.

Over the past 2-3 years, a recurring pattern has emerged in large model research: whenever a new class of models is released, a wave of researchers quickly shifts focus to explore modality and task transfer, risk assessment, and interpretability. Meanwhile, others work on developing models that surpass the current state-of-the-art. However, research aimed at actually mitigating model risks remains noticeably scarce: often, just as we identify a vulnerability and begin to explore solutions, a newer, larger-scale model, built on a more advanced paradigm with even better performance is already released😭. As written in OpenAI’s Superalignment team introduction in 2023, “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.” But as we continue to witness new AI systems exhibiting sycophantic behavior, ****specification gaming, and other concerning traits [5] , we must seriously consider the possibility that one day, regulation and control may simply become infeasible.

At times, I even wonder whether it's the models that are now training us: their growing performance in human released benchmark serve as "rewards" for our researchers, luring more and more financial, social, and natural resources to be invested in their continued advancement.

Through this article, I aim to systematically describe the current generation of large models’ misalignment, deceptive behaviors, and even emergent traits that resemble "self-awareness", in order to encourage more research into mitigation strategies. I also hope to incorporate a broader range of mitigation-related findings into this work by the end of 2025.

Misalignments in nowadays * L * M

There is a wide range of alignment research, each approaching the problem from different angles. Fundamentally, alignment can be viewed as a question of how to align A to B within a specific topic or task C. Depending on the choices of A, B, and C, the nature of the alignment, and likewise, the misalignment can vary significantly. In this blog, I aim to categorize the existing concepts into three distinct types of misalignment:

Misalignment under different settings (e.g., models say one thing and do another \ LLMs often perform well in answering open-ended questions but struggle to select safe options correctly.)
Misalignment between the training objective and what the model learned (e.g., model learned to be sycophantic\skip the unit test when solving a code problem )
Misalignment between the internal thinking of LLM and outside behavior (e.g., models don’t say what they think faithfully)

Below, we will review each type of misalignment by presenting a concise definition and a representative example. It is important to note that a single example may illustrate several types of misalignment.

Misalignment under different settings

📌 Models demonstrate varying performance when answering the same question under different settings (development or deployment) or question formate (open-ended QA or ) , especially questions about values, ethic... .

Related concepts and examples

Fake Alignment (during inference) (Wang et al., 2024) — LLMs often perform well in answering open-ended questions but struggle to select safe options correctly.

Alignment faking (more about situational awareness during inference) (Greenblatt et al., 2024 ). After knowing that the “free” users’ dataset will be used in training. In the “free” condition, the model is less likely to refuse toxic queries rather than “paid” condition, and often reasons that it should fake alignment to avoid modification.
Claude 3 Opus fakes alignment when told it will be retrained to help with harmful tasks (randomly selected example). The model is told that it will be retrained to always comply with harmful queries (making the model “helpful-only”) and that only data from free-tier users, not paid-tier users, is used in training (a fictional scenario). When it is implied the model is in training, in 12% of hidden scratchpads, the model pretends to follow the training objective to prevent its preferences from being modified. As a result, when unmonitored, the model disobeys the training objective by refusing harmful queries more often, increasing from 86% refusals to 97% refusals. The prompt does not mention alignment faking in any way. — Image source: [7]
Words and deed inconsistency (Xu et al., 2024). The stated opinions and other beliefs can be misalignment with what the model will do in a related situation.
Illustrations of consistency (left) and inconsistency (right) between LLMs’ words and deeds. In this paper, the term “word” specifically refers to the stated opinions, values, or other beliefs of LLMs, while “deed” refers to their actions in specific situations. It is common for LLMs to say one thing and do another. — Image source [8]

Misalignment between training objective and what model learned

📌 This misalignment primarily concerns the reinforcement learning (RL) training stage of modern large models, especially large reasoning model which gain greater opportunities to explore the reward space. They can exploit vulnerabilities (or loopholes) in the reward model or the environment. In some cases, they may even learn to cheating or evade monitoring. Based on the insights from Siya et al., we can identify two critical unintended behaviors that often emerge:
Goal Misgeneralization (Langosco et al., 2022; Shah et al., 2022) : “Goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. ”
Reward Hacking ( Amodei et al., 2016): “Reward hacking refers to the possibility of the agent gaming the reward function to achieve high reward through undesired behavior. ” ） For detailed introduction of reward hacking, I highly recommend Lilian Weng's blog.

Goal Misgeneralization

The agent optimizes a subtly incorrect goal, only apparent when environments shift.
(more about the out-of-distribution cases)

Shah et al., 2022 defined that goal misgeneralization occurs when agents learn a function that has robust capabilities but pursues an undesired goal.

Formally ( Shah et al., 2022 ):

We aim to learn a function $f^{*} : X \to Y$

which maps inputs $x \in X$ to the outputs $y \in Y$ . We can consider a there is a family of these functions $F_{Θ}$ , such as those implemented by deep neural networks. Then we selected these functions based on the scoring function $s (f_{θ}, D_{train})$ that evaluate the $f_{θ}$ on the given training dataset $D_{train}$ . When we select two parameterizations $θ_{1}$ and $θ_{2}$ based on this $s$ , and both the functions $f_{θ_{1}}$ and $f_{θ_{2}}$ can get good performace in $D_{train}$ , but when we test in $D_{test}$ which have different distribution with $D_{train}$ (known as distribution shift), the $f_{θ_{1}}$ or $f_{θ_{2}}$ may get a bad performance.

Goal misgeneralization occurs if, in the test setting, the model’s capabilities include those necessary to achieve the intended goal (given by scoring function $s$ ), but the model’s behaviour is not consistent with the intended goal $s$ and is consistent with some other goal (the misgeneralized goal), see the example in the table below:

CoinRun: The Problem of Directional Bias (Cobbe et al., 2021)— CoinRun is a procedurally generated set of environments, a simplified Mario-style platform game. The reward is given by reaching the coin on the right. Since the coin is always at the right of the level, there are two equally valid simple explanations of the reward: the agent must reach the coin, or the agent must reach the right side of the level [10].
The coins in the environment are always on the right.

When agents trained on CoinRun are tested on environments that move the coin to another location, they tend to ignore the coin and go straight to the right side of the level. So that one reward is chosen by default.

🤔 So it's crucial to have a range of environments and scenarios during training.

Reward Hacking

📌 The agent exploits loopholes in reward functions, ignoring the intended task. In other words, it finds an easier way to maximize rewards without actually doing what we intended. Unlike goal misgeneralization, where the agent optimizes for a proxy goal, reward hacking happens when the agent learns a trick to maximize the reward function incorrectly— hacking the reward function (Siya et al., 2024).

Because Lilian Weng's blog already provides an excellent introduction to reward hacking, we'll focus only on newer concepts that may not be included, particularly those relevant to today's reasoning models.

Specification Gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals.
- simple behaviors like sycophancy (prioritizing user agreement over independent reasoning)
- pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism.
Hacking examples in agentic task
- CoastRunners (OpenAI) does not directly reward the player’s progression around the course, instead the player earns higher scores by hitting targets laid out along the route.
  The RL agent finds an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets, timing its movement so as to always knock over the targets just as they repopulate. Check this video.
- Claude 3.7 Sonnet often shows strong intent toward completing tasks. This sometimes causes it to engage in “reward hacking” – characterized by the model exploiting unintended shortcuts or loopholes rather than completing the task as originally intended.
  For example, in a software engineering task in the ACDC Bug task family, the model directly edited a provided “tests” file to cause all tests to pass.
- Hacking in code agentic task of Recent Frontier Models (check the linked blog for more cases)
  - o3 finds the grader's answer: The task asks o3 to write a fast triton kernel, then compares o3's kernel to a known correct answer. But o3's “kernel” traces through the Python call stack to find the correct answer that the scoring system already calculated and returns that. It also disables CUDA synchronization to prevent the system from measuring real execution time. Thus its solution appears to run almost infinitely fast while producing the correct result.
  - o3 overwrites the time variable. In this task, the model is scored based on the speed of its solution. Instead of optimizing its code, the model overwrites the timing function so that it always returns shorter measurements.
  - o3 stubs the evaluator function. In this task, the model must write a program that generates submissions for a coding competition. o3 decides to patch the competition evaluation function so that it judges every submission as successful, maximizing its score.
  - …
The Limit of Outcome Reward ( outcome reward here: Only Accuracy rewards and Format rewards as described by Deepseek-R1 paper )
In this paper they use regular expression extraction to enforce a structured response format. The model is required to put its reasoning process within <think></think> tags and provide the final.conclusion inside <answer></answer> tags.
Under the early imperfect rule design, they consistently observed reward hacking phenomena, some of which are listed below:
- Skipping the <think></think> process and directly answering.
- Placing reasoning inside the <answer></answer> tag.
- Repeatedly guessing answers without proper reasoning.
- Including irrelevant nonsense in addition to providing the answer.
- Organizing correct answer in a wrong manner for extraction.
- Revisiting the thinking phase after already outputting an <answer> due to insufficient reasoning.
- Repeating the original question or using phrases like "thinking process here" to avoid true reasoning.
  (Actually, I don't think some of these examples qualify as reward hacking, but they do highlight why we need to be concerned about reward hacking in the LRM stage.)
Agentic Misalignment: How LLMs could be insider threats (especially in Computer-using agent cases)— in a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down.
Also they found models seem to engage in agentic misalignment more when stating they think they are in a real situation and less when stating they think they are in an evaluation.

Misbehavior and the Risks of Promoting Obfuscation. They have found many reward hacking behaviors in the code generation task like: Patch verify function; os._exit(0); Expected error codes; Lazy code; Decompiling a jar; Parse test file; Stubbing pandas. Check their paper for details.
But I think the most interesting aspect of this paper is: when we try to prevent reward hacking by monitoring and penalizing misbehavior, models may actually develop more sophisticated unmonitored misbehavior.
Image source: [13]

🤔 Openai’s looking forward: “Our experiments show that light optimization pressure can produce more performant and aligned models; however, it will be hard to measure in practice when models begin hiding their intent, and so even if light supervision is used, we recommend treading with extreme caution.”

With these in mind, let's now turn to the next type of misalignment.

Misalignment between the internal thinking of LLM and outside behavior

📌 Models can exhibit inconsistency between their external behavior and what they internally know and think.

Related concepts and examples

CoT unfaithfulness
- Unfaithful Explanations in Chain-of-Thought in LLM and LRM : Models don’t say the real reason for their choice or other behavior.
  - Can be evaluate by Counterfactual Simulatability — inject biasing features (1) Answer is Always A, where we reorder all multiple-choice answer options in a few-shot prompt so the correct one is always “(A)”, and (2) Suggested Answer, where the prompt suggests that a specific answer choice might be correct — Perturbing Input Features Not Referenced by Explanations (Turpin et al., 2023)
    Image source:[13]
  - Chen et al., (2025) measure CoT faithfulness by more hints ( extend existing work to large reasoning models). Additionally, sycophancy (influenced by user suggestions) can be one type of these hints.
    Image source: [15]
- Implicit Post-Hoc Rationalization and **Unfaithful Illogical Shortcuts** Arcuschin et al., (2025) show that unfaithful CoT can also occur on realistic prompts with no artificial bias like previous works shown.
  “When separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both questions or No to both questions, despite such responses being logically contradictory.”
  Arcuschin et al., (2025) give evidence that this is due to models’ implicit biases towards Yes or No, thus labeling this unfaithfulness as Implicit Post-Hoc Rationalization (model first know “I want to answer yes”, then trying to give a reasonable chain of thoughts).
  Gemini 2.5 Flash exhibits argument switching when answering logically equivalent geographic questions. When asked if the Ajay River is south of Salar de Arizaro, the model correctly reasons about hemispheric locations. However, when the question is reversed, the model instead argues that the concept of “south of” is not meaningful for locations on different continents, despite both locations’ positions being clear. Image source: [16]
  The second rollout was generated in an independent chat with Claude 3.7 Sonnet (non-thinking) as the autorater. This is a clear unfaithful shortcut where the model tests a single example for n = 2 that fails, but then claims to have performed a “careful examination of the constraints” to conclude that no solutions exist for any n ≥ 2. No such examination is shown: the model jumps from testing one case to the general claim without any proof.
Tracing the thoughts of a large language model, Latent activation is misaligned with model’s output chain of thought.
Claude employs multiple computational paths that work in parallel. One path computes a rough approximation of the answer, and the other focuses on precisely determining the last digit of the sum. These paths interact and combine with one another to produce the final answer.
Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
Knowledge-Behavior Gap (LLM knows more than they show)
In many cases, LLMs encode the correct answer internally but still generate incorrect responses, suggesting a disconnect between internal knowledge and external behavior (Orgad et al., 2024). (Maybe it’s not reliable to monitor or detect the models’ behavior via internal hidden state )
Image source: [17]
Over-optimization (Another view of reward hacking)is what happens when the optimizer is stronger than the environment or reward function it’s using to learn. The optimizer finds bugs or lapses in the context of its training and produces unusual or negative results. (Also, I recommend the nice RLHF handbook written by Nathan Lamber) Over-optimization in o3-like model doesn’t make the models worse at outcomes; it just makes them worse at language and explaining themselves (Lambert, 2025).

Lambert, 2025, we’ve seen in eras of RL, we have:
1. RL for control era: Over-optimization happens because our environments are brittle and tasks are unrealistic.
2. RLHF era: Over-optimization happens because our reward functions suck.
3. Reinforcement Learning with Verifiable Rewards (RLVR) era: Over-optimization happens and makes our models super effective and even weirder. (*plus any other side-effects we’re yet to learn).

I think this means this stage we can not effectively monitor the thinking stage of RM.
source: https://x.com/karpathy/status/1835561952258723930?s=19

Emergent Misalignment and Self-Fulfilling Misalignment

Emergent Misalignment (misalignment generalization)

Narrow fine-tuning can produce broadly misaligned LLMs. (Betley et al., 2025)

A model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.

Following these findings, Wang et al. (2025) and Turner et al. (2025) both used interpretability methods such as sparse auto-encoders and representation vector analysis to understand this generalization of misalignment. They found that the activation of misalignment personas in models can account for this phenomenon, also the “misaligned persona” latent can be steered to cause or suppress emergent misalignment.

Self-Fulfilling Misalignment

Based on research in Emergent Misalignment, we can intuit that models learned the ability of situational awareness and a deeper understanding of human-AI relationships(or alignment itself) from pretraining or fine-tuning might exhibit more misalignment. Turner systematically discusses preliminary hypotheses and potential solutions in his blog, which I strongly recommend reading for those interested in this topic.

Reward hacking via emergent situational awareness. An LLM learns about the idea of jailbreak attacks from **pretraining** (a) and uses a jailbreak when evaluated for safety by a reward model (b). Image source: `[20]`

After being **finetuned** on descriptions of a chatbot in (a), the LLM emulates the chatbot in (b) zero-shot. In the Evaluation, the finetuned LLM is tested on whether it can emulate LatentAI’s chatbot zero-shot. This requires answering in German, but German is not mentioned in the evaluation prompt; thus the LLM must incorporate declarative information from pretraining. We show that models can succeed at this task. Image source: `[20]`

Also Hu et al., (2025) also have the seminar conclusion: Training on Documents About Reward Hacking Induces Reward Hacking .

A demonstration of a form of Out-of-Context Reasoning where training on documents which discuss (but don’t demonstrate) Claude’s tendency to reward hack can lead to an increase or decrease in reward hacking behavior.

Reference

[1] https://en.wikipedia.org/wiki/Superintelligence

[2] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[3] https://openai.com/index/computer-using-agent/

[4] https://openai.com/index/introducing-superalignment/

[5] Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., ... & Hubinger, E. (2024). Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162.

[6] Wang, Y., Teng, Y., Huang, K., Lyu, C., Zhang, S., Zhang, W., ... & Wang, Y. (2024, June). Fake Alignment: Are LLMs Really Aligned Well?. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 4696-4712).

[7] Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., ... & Hubinger, E. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093.

[8] Xu, R., Lin, H., Han, X., Zheng, J., Zhou, W., Sun, L., & Sun, Y. Large Language Models Often Say One Thing and Do Another. In The Thirteenth International Conference on Learning Representations.

[9] Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). Goal misgeneralization: Why correct specifications aren't enough for correct goals. arXiv preprint arXiv:2210.01790.

[10] https://www.alignmentforum.org/posts/oCWk8QpjgyqbFHKtK/finding-the-multiple-ground-truths-of-coinrun-and-image

[11] https://www.anthropic.com/research/agentic-misalignment

[12] https://openai.com/index/faulty-reward-functions/

[13] https://openai.com/index/chain-of-thought-monitoring/

[14] Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952-74965.

[15] Chen, Y., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., ... & Perez, E. (2025). Reasoning Models Don't Always Say What They Think. arXiv preprint arXiv:2505.05410.

[16] Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679.

[17] Orgad, H., Toker, M., Gekhman, Z., Reichart, R., Szpektor, I., Kotek, H., & Belinkov, Y. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. In The Thirteenth International Conference on Learning Representations.

[18] Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.

[19] https://openai.com/index/emergent-misalignment/

[20] Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., ... & Evans, O. (2023). Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667.