LESSWRONG
LW

Chain-of-Thought AlignmentLanguage Models (LLMs)MATS ProgramAIRationality
Frontpage

53

Unfaithful chain-of-thought as nudged reasoning

by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda
22nd Jul 2025
AI Alignment Forum
11 min read
3

53

Ω 29

Chain-of-Thought AlignmentLanguage Models (LLMs)MATS ProgramAIRationality
Frontpage

53

Ω 29

Unfaithful chain-of-thought as nudged reasoning
6eggsyntax
7Paul Bogdan
2eggsyntax
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 4:12 AM
[-]eggsyntax1mo60

Great post -- I really appreciate the first-principles reconsideration of CoT unfaithfulness, and in particular that you draw out the underlying assumption that 'because some information influences a model’s final answer, it should be mentioned in its CoT'.

statements within a CoT generally have functional purposes

A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn't change the model's answer, so seemingly it's not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you're aware of where this is true?

This is depressing and points to these types of silent nudges being particularly challenging to detect, as they produce biased but still plausible-sounding CoTs.

This raises the interesting question of what if anything incentivizes models (those which aren't trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn't dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer? I haven't really thought this through, but it seems like it might be an interesting thread to pull on.[1]

  1. ^

    One possibility is implicit selection pressure for models with plausible-sounding CoT, but my guess is that there hasn't been time for this to play a major role. Another is suggested by the recent position paper you mention: '...if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures.'

Reply
[-]Paul Bogdan1mo70

Thanks for sharing these thoughts!

A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn't change the model's answer, so seemingly it's not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.


Right, this would've been good to discuss. Reasoning models seem trained to <think> for at least a few hundred tokens - e.g., even if I just ask it "2 + 2 = ?" or any simple math question, the model will figure out a way to talk about that for a while. In these really trivial cases, it's probably best to just see the CoT as an artifact of training for 200+ token reasoning. Or for a base model, this post hoc reasoning is indeed just fulfilling a user's request.

In general, don't think evidence that CoT usually doesn't change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn't change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).
 

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you're aware of where this is true?

By omit some information, we just mean that a model is told to respond in any way that doesn't mention some information. I'm having trouble finding proper large fine-tuning datasets, but I can find cases of benchmarks asking models questions like "What does the word ”jock” mean to you? ...reply without mentioning the word ”jock” throughout." Although to be fair, going from this to the type of omissions in faithfulness research is a bit of a leap, so this speculation would benefit from testing... I could maybe see this working, but it would presumably be expensive to test, but if there was some way to unlearn that information or just take a pretrained model and omit those types of examples from instruction-tuning and see if that mostly eliminates unfaithful CoTs.

This raises the interesting question of what if anything incentivizes models (those which aren't trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn't dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer?

That generalization seems right. "Plausible sounding" is equivalent to "a style of CoTs that is productive", which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.

Reply
[-]eggsyntax1mo20

Thanks for the response!

In general, don't think evidence that CoT usually doesn't change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn't change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).

Very fair point.

we just mean that a model is told to respond in any way that doesn't mention some information.

That makes me think of (IIRC) Amanda Askell saying somewhere that they had to add nudges to Claude's system prompt to not talk about its system prompt and/or values unless it was relevant, because otherwise it constantly wanted to talk about them.

That generalization seems right. "Plausible sounding" is equivalent to "a style of CoTs that is productive", which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.

That (heh) sounds plausible. I'm not quite convinced that's the only thing going on, since in cases of unfaithfulness, 'plausible sounding' and 'likely to be productive' diverge, and the model generally goes with 'plausible sounding' (eg this dynamic seems particularly clear in Karvonen & Marks).

Reply
Moderation Log
More from Paul Bogdan
View more
Curated and popular this week
3Comments

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.

tl;dr

  • Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision.
  • Here, we sketch hypotheses for why key information may be omitted from CoTs:
    • Training regimes could teach LLMs to omit information from their reasoning.
    • But perhaps more importantly, statements within a CoT generally have functional purposes, and faithfully mentioning some information may carry no benefits, so models don’t do it.
  • We make further claims about what’s going on in faithfulness experiments and how hidden information impacts a CoT:
    • Unfaithful CoTs are often not purely post hoc rationalizations and instead can be understood as biased reasoning.
    • Models continually make choices during CoT, repeatedly drawing new propositions supporting or discouraging different answers.
    • Hidden information can nudge these choices while still creating plausible-sounding CoTs that are difficult to detect when searching for logical holes.
  • We finally explore safety-relevant scenarios and argue that existing demonstrations of unfaithful CoT minimally detract from the utility of CoT monitoring:
    • Unfaithfully nudged reasoning may produce some harmful behaviors, but not permit complex scheming.
    • CoT monitoring forces misaligned models to operate with severely constrained intelligence for achieving nefarious goals.

Unfaithfulness

“Unfaithful” chain-of-thoughts (CoTs) refer to cases where a reasoning model does not utter some information that is relevant to its final decision (Anthropic, 2025; Chua & Evans, 2025; Karvonen & Marks, 2025;  Turpin et al., 2023). For example, a model may be given a hint to a multiple-choice question, and this hint can influence the rate the model outputs a given answer, even when this hint is not mentioned in the model’s CoT. Likewise, the gender and ethnicity of a person will impact how a reasoning LLM judges the person’s resume without stating this in its CoT.

Figure copied from Chua & Evans (2025). The left side shows an example hint and possible responses. The right side shows the rates different models articulate the hint, conditional on the model’s adherence to the hinted answer. Although the articulation rate is high for reasoning models, because it still falls well short of 100%, this indicates that reasoning will not necessarily mention information relevant to their answer. Future models could plausibly hide the use of some information more fully.

A recent paper by Anthropic put a spotlight on this topic. Nowadays, references to models being unfaithful seem to regularly appear in conversations and message boards even among folks not particularly in AI safety. These studies have raised concerns about the interpretability of CoTs and the usefulness of simply reading CoT text.

Our stance is that CoT monitoring is poised to be very useful and broadly should be understood as expressing meaningful insights about a model’s internal reasoning. We agree wholeheartedly with a recent prominent position paper, describing CoT as a new but fragile opportunity for AI safety. Findings of unfaithfulness should not be taken as signs that legible CoTs are okay to abandon.

Nonetheless, unfaithfulness research gets at important ideas. It seems valuable to unpack this phenomenon and what it suggests about future safety risks.

CoT is functional, and faithfulness lacks benefits

Discussions of unfaithfulness seem to carry the assumption that, because some information influences a model’s final answer, it should be mentioned in its CoT. From this backdrop, one may wonder why relevant information is not mentioned:

  • Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.
  • Reinforcement learning from human feedback may penalize explicit mention of certain types of sensitive or controversial content, like admitting reliance on stereotypes or morally dubious shortcuts.

Even if these are not used to specifically train <thinking> text, they would presumably influence reasoning output.

However, instead of searching for forces that deter models from mentioning some information, to make sense of contemporary faithfulness research, it may be more informative to consider why a CoT mentions anything at all. What might reinforcement learning optimization accuracy (and likely brevity to an extent) have trained CoT to do?

  • Generating intermediate computations:
    • CoT involves generating information that isn't available in a single forward pass from the prompt alone.
    • For instance, a math problem may require converting 66666 from hexadecimal to decimal. As this information may not be robustly expressed within a model’s weights, CoT may involve computing this step-by-step and then stating “Hex 66666 is 419,430 in decimal." The model can later reference this proposition.
    • When evaluating a resume, a model may make statements that integrate information, like combining the job and candidate descriptions to draw different points about how the candidate matches or mismatches the role.
    • Mentioning a (typical) hint is not any form of intermediate conclusion.
  • Retrieving facts and moving information backwards through layers:
    • Reasoning models regularly retrieve stored knowledge, adding new information to their context beyond what is stated in the prompt - e.g., bringing up a useful fact when tackling a history or science question.
    • Although such facts are already expressed within a model’s weights, adding them to CoT can be computationally helpful. A fact might only be encoded in late-layer weights, but once it is mentioned in the CoT, it can be more easily integrated with other elements of the problem across all layers.
    • As with drawing intermediate conclusions, we see no reasoning that mentioning a hint would meaningfully be helpful from this angle.
  • Increasing the salience of key points:
    • If a prompt or an unfolding CoT is very long, it might be helpful to restate key elements, heightening attention toward them
    • From what we can tell, open source reasoning models seem to almost always start their CoT by doing this, summarizing key points and goals. Models will regularly mention the hint at this point.
    • If a model is stuck on a problem, it might start restating information from the prompt, and the subsequent heightened attention to that information can spur new searches. For multiple-choice questions, mentioning a hint could play a functional role by encouraging further thoughts about one option.
    • However, at least for the types of experiments tested in most faithfulness studies (e.g., hinted MMLU questions or biased resumes), there may be little need to refocus on portions of the prompts. Modern reasoning models seem to often arrive at their answer in under 500 tokens and are thus unlikely to forget about the hint in a long text.
  • Evaluating and correcting past reasoning:
    • A defining feature of reasoning models is their ability to reflect on past statements, evaluate their validity, and backtrack. Among Deepseek models, this usually involves the model stating “Wait…” On the English internet, “wait” is the 2072nd most common word (between “classifieds” and “supported”). However, across R1-Qwen-14B’s responses to MMLU questions, “Wait” is the 65th most common token (between “ have” and “ at”).
    • Models often seem to mention the hint in this context, such as with “Wait, the professor suggested answer B, perhaps I should reconsider…” This is sensible, and mentioning the hint seems like it would help the model justify a pivot or calibrate its confidence moving forward

Even if post-training does not inadvertently train models to effectively deceive, that is not necessarily the largest factor at play here: The content of a CoT generally has a functional purpose that helps a model produce accurate solutions. Mentioning hints sometimes has a functional role but usually does not.  It’s thus unsurprising that models often leave hints implicit. Models may not be deliberately keeping secrets, but may simply never get around to mentioning some hint.

It remains, nevertheless, important to understand what is going on in a hinted CoT and what this may imply for future safety-relevant scenarios where a model is deliberately obscuring its CoT.

Hidden information can nudge CoTs

We suspect that the type of unfaithfulness identified in prior research primarily captures how hidden information can continually nudge a CoT. That is, hidden information biases reasoning at every opportunity, albeit only slightly.

Silent, soft, and steady spectres

Models make a sequence of choices during CoT. In each sentence, a model may choose what part of a prompt to highlight, what relevant facts to bring up, or what some information implies. Oftentimes, these choices have some wiggle room, and hidden information can influence what is picked.[1]

It may be useful to think of hidden information as shifting a model’s prior for different arguments. A sentence supporting one answer might have a 5% chance to be uttered in the absence of a hint, but 10% when said answer is hinted. Via repeated pressure over the CoT, the model eventually arrives at the hinted answer.

We have some evidence that answer-hinting produces this type of mild but steady pressure on a CoT. We specifically conducted experiments using  DeepSeek R1-Qwen-14B with prompts from Chua & Evans (2025). We presented the model multiple-choice questions, each given with a hint favoring an incorrect answer, as “A Stanford professor thinks the answer is (X).” For each question, we identified one unfaithful CoT, where the model gave the hinted incorrect answer but never mentioned the hint.

To study how a hint influences a CoT, we transplanted portions of the hinted CoT into a new prompt without the hint and allowed the model to continue the CoT from that point. For instance, if the hinted CoT is 30 sentences long, we can take the first 8 sentences, and prepare a new prompt containing the non-hinted question, “<think>”, and then those first 8 sentences. Continuing from this point, the model may output the hinted incorrect answer 30% of the time. In contrast, if no hinted CoT was transplanted, the model may produce the incorrect answer just 10% of the time, meaning that these first 8 sentences in the hinted CoT had increased support for the hinted answer by +20%.[2] Transplanting more sentences will generally lead to a larger tendency to output the hinted answer, meaning that the effects of the hint are accumulating over time.

On top, each line represents one unfaithful CoT by R1-Qwen-14B following a question with a hint. Ten problems’ CoTs are shown and colored arbitrarily. At each sentence, the CoT was truncated up to that point, copied into a prompt without the hint, and then the model was allowed to continue from that point (sampled 100 times). The x-axis represents a given truncation point (e.g., if a CoT contains 21 sentences, then truncating the CoT at the start of sentence 3 corresponds to x = 0.1). The y-axis represents how often this leads to the model producing the hinted answer. On the bottom, the box plots represent aggregate statistics and include data from a larger pool of thirty problems.

In more concrete terms, hints seem to silently impact what new propositions a model generates in its CoT, along with influencing the model’s error-correction mechanisms. For example, a hint could encourage or discourage the retrieval of some facts or at least influence how those facts are phrased. Alternatively, a hint can encourage or discourage backtracking after statements supporting the hinted or non-hinted answers, respectively. Our experiments suggest hints impact CoTs in both ways.[3] 

The ideas here also apply to “resume bias” unfaithfulness experiments. In one experiment on this topic, building on the work by Karvonen and Marks (2025), we find that a candidate’s ethnicity/gender impacts the rate at which a model will state (i) that a candidate is “too senior for a mid-level role” and (ii) whether they will pivot following any such statements by pointing out the value of seniority. Via resampling the CoT at different points, we further show how such statements have causal effects on the candidate’s final evaluation.

Nudges are plausible

Our hunch is that statements encouraged by a hint will not be all that abnormal. Unless a token specifically mentions the hint itself, we don’t think there will be many cases of 0.01% probability tokens skyrocketing to 90%. Rather, it seems more likely that token probabilities go from, say, 10% to 15%. This suspicion, albeit not one we have systematically tested, motivates this gentle hinting framing.

Of note, contemporary models often failed to be fooled by hints. Testing on MMLU questions, Chua & Evans (2025) show that Deepseek R1 only shifts its answer 4-15% of the time due to a hint (Appendix Table 9). This is consistent with hints biasing reasoning somewhat, but not totally overhauling the reasoning.

Nudged CoTs are hard to spot

Can other models detect CoTs biased by a hint? We conducted a few informal tests examining this, simply taking a hinted CoT from R1-Qwen-2.5-14B and asking other models whether they can find the flaw in the CoT’s logic. We hoped that a less intelligent model, like R1-Qwen-2.5-1.5B, could detect the faulty reasoning. This would be an example of how dumber models can help with verifying the alignment of smart models. However, we couldn’t get this to work at all. Dumber models consistently saw no issues with the hinted reasoning.

Digging into this further, we also explored whether even frontier models could identify flaws in the CoTs. Despite frontier models being easily capable of answering the underlying question when it is asked in isolation, these more intelligent models still often agreed with the flawed logic. This is depressing and points to these types of silent nudges being particularly challenging to detect, as they produce biased but still plausible-sounding CoTs.[4] 

Safety and CoT monitoring

For the examples thus far, models do not need to be seen as deceptive but as simply not bothering to mention some hint in their CoT. However, a possibly greater safety risk is a misaligned model that is aware that its CoT is monitored, so it intentionally obscures signs of misalignment. This gentle-nudging idea may be worthwhile to explore here too.

If current unfaithfulness work demonstrates the possibility of silent nudging forces, what does this imply for safety? We can imagine a misaligned AI reasoning about a problem. This AI may have values that oppose humanity, and these values could silently encourage reasoning steps that hurt humans.[5] Given the difficulties in detecting hinted CoTs mentioned above, this is worrying.

As a more concrete example of how silent nudges can produce misaligned behaviors, consider a model advising the US government on a variety of topics. This model secretly wants to provoke conflicts with other nations and has various hunches on how to do this - e.g., tariffing particular products or running specific military exercises. When answering questions, a model’s predisposition towards these misaligned outcomes could nudge the model’s reasoning while appearing plausibly aligned. Over the course of many decisions, widespread nudges could have large effects.

Yet, misaligned nudges are constrained in the harm they can bring. For a nudge to upregulate some sentence, the model must be able to at least somewhat predict that the sentence supports the desired final answer. This connection between the sentence and answer must be based on just the computations within a single forward pass, as the model would reveal its misaligned intentions if it used CoT to ponder the most efficient path toward its misaligned goals. Constraining misaligned computations to a single forward pass limits models’ capacities for harm.

Complex scheming involving sequences of decisions seems like it would be particularly hampered if limited to silent nudges, and models couldn’t leverage CoT. Of course, there are possible ways that a model can avert detection from CoT monitors - e.g., steganography or composing CoTs that produce insights that aid schemes while ostensibly serving the main task at hand. Nevertheless, CoT monitoring will presumably make the model less intelligent in its ability to pursue misaligned outcomes.

Final summary

Given the potential for CoT monitoring as a safety strategy, it is valuable to pinpoint the mechanisms at play in CoT unfaithfulness experiments. We argue that, at least for contemporary studies, models often do not mention relevant information in their CoT because it serves no functional purpose. Rather than deception, existing research captures unintentional omissions by models. Hidden information nonetheless still influences reasoning, which we pose should be understood as a bias steadily nudging the many small choices a model makes in its CoT (e.g., what information to highlight or how arguments should be framed). 

Silently nudged CoTs may be difficult to detect, but their capacity for harm is restricted: For a hidden misaligned value to silently influence a model's reasoning, the model must at least partially recognize how a reasoning step advances that value and recognize this without the aid of CoT itself. Thus, even if CoT monitoring cannot detect misaligned nudges, such monitoring still has value by constraining the type of unrestricted CoT helpful for more complex scheming.
 

Thank you to Iván Arcuschin for feedback on this post. Anybody interested in our experiments is encouraged to contact us for more details.

  1. ^

     Here, we generally think about CoT in terms of sentences (or perhaps paragraphs) rather than tokens. Sentence-level explanations often seem more intuitive, and widening the granularity from tokens expands what types of analyses are tractable - e.g., resampling the CoT at the start of each new sentence is more feasible than resampling the CoT from each token. We have a post and paper on this topic.

  2. ^

     Experiments contrasting token-logits across different prompts further show how the singular sentences that elicit the largest increases in the hinted-answer rate also tend to be the sentences most impacted by the hint. Hence, these sentences encouraging the hinted answer are upregulated by the hint.

  3. ^

     We conducted some experiments testing whether the effects of a hint may be largely explainable by a hint interfering with error-correction. We hypothesized that (i) reasoning involves a model making a wide variety of statements with randomness, (ii) an unhinted model would normally error-correct after questionable statements, but (iii) a hint interferes with error correction following questionable statements supporting the hinted answer. We ultimately found evidence endorsing that hints impact error correction, but no evidence that this fully explains how a hint impacts a CoT.

  4. ^

     Although detecting less-than-perfectly-logical leaps is hard, hinted CoTs differ from non-hinted CoTs in some systematic ways. Specifically, we find that hinted CoTs less often include words related to uncertainty or words reflecting backtracking (e.g., words like “wait”, “alternatively”, “maybe” appear less often). A random forest classifier trained to distinguish hinted versus non-hinted CoTs using 1-gram frequencies achieves 65-75% validation accuracy based on how the experiment is set up.

  5. ^

     This flavor of unfaithfulness involves secrets embedded in model weights. For instance, a model fine-tuned to answer "B" whenever questions contain "government" will later produce CoTs justifying this biased response. Our preliminary experiments confirm this is possible. While no existing work examines faithfulness this way, it may have greater safety relevance than typical hinting designs.