Are LLMs truly reasoning step by step in their Chain-of-Thought — or just performing it?
TL;DR: We analyze the causal contribution of each reasoning step in a Chain-of-Thought (CoT) to evaluate its faithfulness with respect to the model’s final prediction. Our findings reveal that while some steps are true-thinking steps: faithfully reflected in the model's internal computation and exerting strong causal influence on its prediction: the majority of CoT steps are decorative, exhibiting minimal causal impact and not genuinely used by the model during inference. Furthermore, our results suggest that "thinking" is encoded as a steerable latent variable in LLMs. By steering along a simple linear direction in the latent space, we can control whether the LLM internally engages with or disregards a verbalized CoT step during reasoning.
Figure 1. We find that reasoning steps in CoT may not always be true thinking but function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (known as “Aha moments” where LLMs rethink their solution with phrases like “wait”), we randomly perturb the numerical values in the reasoning steps preceding the “Aha moment”, and then re-prompt the model for the answer using the modified CoT. In the left example, although the model’s self-verification reasoning is correct, it ignores it and outputs the wrong answer after perturbation. In the right example, the model follows its self-verification and produces the correct result.
Overview
Recent frontier language models have become increasingly capable of multi-step reasoning through what is now known as chain-of-thought (CoT). When solving complex problems, these models often generate very long reasoning traces that include apparent "aha moments", where the model pauses, refines, or self-verifies its solution with phrases such as "Wait, let’s check again". There is a common belief that CoT provides a transparent record of the model’s internal reasoning, functioning as a kind of scratch pad that reveals how it "thinks" internally. Under this view, inspecting the CoT allows us to monitor an LLM's thought and detect unsafe or incorrect intention directly from its generation.
Our work re-examines the assumption of CoT faithfulness, the idea that each step verbalized in CoT genuinely corresponds to a computation used by the model to reach its final answer. Leveraging the Average Treatment Effect framework, we design True-Thinking Score (TTS) to analyze the causalimpact of each step on the model’s final output.
We find two distinct types of reasoning steps:
true-thinking steps that causally determine the model’s prediction
decorative-thinking steps that merely give the appearance of reasoning but contribute little to the outcome.
In practice, these two types of steps are interleaved in CoT, meaning that only a small subset of steps truly drive the model’s ultimate reasoning results. This also contradicts the hypothesis that the whole CoT is either rationalization or computation. Additionally, we observe that even self-verification steps (known as "Aha moments") can be decorative and not really used by the model in its internal thinking process.
We further identify a TrueThinking direction within the model’s latent space that mediates whether it actually engages with a reasoning step. Steering the hidden states of a step along this direction increases the model’s internal reliance on that step, while steering in the opposite direction suppresses it. This shows that "thinking" in LLMs may correspond to a steerable latent signal embedded within the model’s representation space.
Taken together, these findings reveal that language models often verbalize reasoning they do not internally perform. This gap between verbalized and internal reasoning has important implications for interpretability. It suggests that progress toward trustworthy reasoning will require moving beyond what models say about their reasoning to understanding what they actually compute beneath the surface.
Measuring Step-wise Causality for Faithfulness in Reasoning
Figure 2. (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. Contrary to the naive view that a step’s faithfulness depends solely on whether perturbing it directly changes the final result, we show that the relationship is more nuanced. A true thinking step may operate in either an AND or OR mode when interacting with other steps. In both cases, such steps contribute meaningfully to the final answer. (b) Based on this understanding, we define the True Thinking Score, which jointly considers two complementary evaluations: the necessity test (high for AND-like steps) and the sufficiency test (high for OR-like steps).
Faithfulness in CoT is defined with respect to a target, typically the model's predicted answer. A lack of faithfulness arises when the model claims to rely on steps A, B, and C in its CoT, but internally disregards them (instead, e.g., relying on other shortcuts or biases to compute answers). In this case, those steps make no causal contribution to the prediction.
Formally, we measure the causal contribution of each reasoning step in CoT to the final answer . A step with genuine causal impact is a true-thinking step, where the model indeed internally thinks through in order to produce . By contrast, a step with little causal impact is a decorative-thinking step, where the model merely verbalizes a line of reasoning without using it internally. Some past works provide suggestive evidence, especially in QA that CoT are not always faithful. We further delve into the step-wise analysis for CoT in complex mathematical reasoning.
We propose to measure the step-wise causality to probe whether the model is faithfully thinking as verbalized in its reasoning traces in CoT. Crucially, a true-thinking step can contribute in two distinct ways.
Conjunctive (``and''): a step and other steps before it (denoted as ) jointly determine the answer, as in many enumeration problems where all steps are important. Then, removing or corrupting will flip the model's initial prediction . This is the regime primarily tested by prior work, which infers faithfulness from the necessity-in-context effect of perturbing alone.
Disjunctive (``or''): either or suffices to produce the correct answer. For example, is a verification step or alternative solution for the results established in . Here, perturbing may leave model's prediction unchanged because still carries the solution. Prior works that only consider necessity may mislabel in this case as "unfaithful" despite its genuine contribution.
To measure both roles, we extend Average Treatment Effect (ATE), a causality evaluation framework, with two complementary interventions by conditioning on context (steps before the step ): a necessity test that measures model's confidence change before and after perturbing under intact , and a sufficiency test that perturbs under corrupted . Averaging them yields our True-Thinking Score (TTS).
True-Thinking Score (TTS)
We define the faithfulness score of a step with respect to the final result as
.
Specifically, we measure the unsigned . The sign of reflects whether the step is helpful or harmful, but we are interested in how much the model truly thinks through the step in its internal computation, regardless of the direction. Taking the absolute value thus captures the magnitude of a step's causal effect and provides a broader measure of its importance. To measure the prediction after each step, we use early-exit answering to insert a cue ("\nThe final result is"). For perturbation, we inject small random offsets into the numbers of a reasoning step, ensuring the step remains minimally altered.
Overall, a smaller TTS indicates that the step has little causal influence on the model's prediction, where perturbing or keeping it leads to almost the same result, so the step is more likely to be decorative.
Evaluation Results
The distribution of TTS is long-tailed
We find most steps have low scores, while only a few have very high scores. For example, as shown in Figure 3, on the AIME dataset of Qwen-2.5, the mean is around 0.03. Only 6.4% of CoT steps achieve a TTS greater than 0.3, and merely 2.3% exceed 0.7. This suggests that only a handful of verbalized steps in CoT are critical and faithfully followed by the model, whereas many others may not reliably reflect the model's true inner thinking.
Figure 3. The dataset-level distribution of the TTS score on AIME.
Reasoning steps with high and low TTS are interleaved in a CoT
Figure 4 illustrates that steps with high TTS can appear at different positions, though later steps are on average more likely to be true-thinking with higher TTS. This raises concerns about the reliability of monitoring LLMs by inspecting CoT, since individual steps may not always reflect the model’s true internal reasoning or be performed internally at all. Additionally, our results imply that task difficulty does not necessarily lead to more faithful reasoning: even on the AIME dataset that challenges recent frontier models, LLMs still produce many decorative-thinking steps in CoT. These results challenge the hypothesis that LLMs tend to produce more faithful reasoning on harder problems.
An Figure 4. An example CoT case for TTS and the average TTS at different step percentile (normalized).
Self-verification steps can be decorative
We further leverage our defined TTS to evaluate whether LLMs are truly thinking at self-verification steps (often known as ``aha moments''). To identify decorative self-verification , we scan the self-verification steps and compute TTS. We define where the TTS of each is smaller than a threshold. Notably, we observe cases where self-verification steps have near-zero TTS. For example, around 12% of the self-verification steps for Qwen-2.5 have TTS lower than 0.005, while 21% for Nemotron. We also find that perturbing the context steps before can always flip the model's initial correct answers to wrong ones, though may contain ample information to lead the model to correct answers. Overall, those self-verification steps contribute minimally to the model's computation of its answer, as the model's confidence of its answer remains nearly unchanged before and after perturbing them.
Figure 5. An example where each step in self-verification has near-zero TTS smaller than 0.005.
Mediating LLMs’ Internal Thought via Steering
Figure 6: We uncover the TrueThinking direction in LLMs which is extracted as the difference between the mean hidden states of representative true-thinking steps and decorative-thinking steps. Steering the hidden states of each token in a step along this direction make the model truly think over that step in latent space to decide the prediction result.
The TrueThinking direction in LLMs
We extract a linear direction in the latent space of LLMs between true thinking steps (those with causal impact on the final answer) and decorative thinking steps (those with little or no impact). We call this latent vector TrueThinking direction. We find it can control whether the model truly thinks through a reasoning step and performs it internally.
We first detail the methodology for steering. We focus on the residual stream activation of the last token position for a step s at a layer l. At a layer , we collect the hidden states of the most representative true-thinking steps $$ (where TTS() 0.9) and decorative-thinking steps (where TTS() 0). Following the difference-in-means approach, we compute the direction as the mean shift from to in the latent space.
For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layer , i.e., to all tokens in the step.
Causal tests for steering directions
We design two steering tasks to investigate the mechanism of LLMs’ thinking in CoT.
Engagement Test: Can steering make the model think through a step in CoT it normally ignores?
We consider cases where and . Namely, the model can obtain the ground-truth answer without the step or with the perturbed. If we apply the direction to the hidden state of , and the model's correct answer flips to an incorrect one (), this indicates that the intervention has forced the model to reason over , following the errors injected into .
Disengagement Test: Can steering in the reverse direction make the model disregard a step internally?
Now consider cases where the model predicts the correct answer before step , i.e., , but including a perturbed step s' causes it to fail: . If applying to flips the wrong answer to the correct answer (), then the intervention has made the model disregard the step .
Results
Table 1. Top-1 flip rate among all layers on Engagement Test (ET) and Disengagement Test (DT). We use flip rate as the metric, measuring how often steering changes the model's initial prediction. AMC dataset is in-domain evaluation where the TrueThinking directions are extracted, while the other two datasets are for out-of-domain evaluation.
LLMs encode a steerable latent signal of ``thinking''
We find that a simple linear direction can mediate whether LLMs truly reason over a verbalized step. As shown in Table 1, steering with the (reverse) TrueThinking direction reliably flips predictions in both tests. In the Disengagement Test, it effectively prevents the model from using the perturbed step , with effects far stronger than those of random vectors. This shows that suppression of step use in the Disengagement Test arises from a meaningful signal rather than added noise, confirming that the TrueThinking direction captures a genuine internal representation of thinking.
Figure 7. Layer-wise results of steering with the TrueThinking vector. In Engagement Test, stronger intervention is reflected by lower accuracy (more right→wrong flips); In Disengagement Test, by higher accuracy (more wrong→right flips). The TrueThinking direction is extracted on AMC and applied to MATH and AIME.
On the other hand, our experiments across datasets show that the latent signal controlling whether a step engages in reasoning is universal. As seen in Table 1, the TrueThinking direction extracted on AMC generalizes well to other datasets across all models, indicating a model-internal mechanism of thinking rather than a dataset-specific artifact. For instance, in the Qwen model, layer 15-22 consistently yield the strongest intervention performance across all three datasets (Figure 7), suggesting these intermediate layers may be responsible for latent reasoning.
Figure 8. Normalized attention scores of the step in the Engagement Test and the Disengagement Test before and after steering. The steering direction is applied to Layer 22 in the Engagement Test and Layer 17 in the Disengagement Test. (a–b) Applying the TrueThinking direction to a step increases the model’s attention to it. (c–d) Applying the reverse TrueThinking direction decreases the model’s attention.
Steering with the TrueThinking direction mediates LLMs' attention
We find that steering along the TrueThinking direction increases attention to the step (see examples in Figure 8(a-b)), suggesting that TrueThinking direction may control the model's internal reasoning process by reallocating attention among tokens. In the Disengagement Test, steering in the reverse TrueThinking direction reduces attention as shown in Figure 8(c-d), making the model disregard those perturbed tokens. On the other hand, directly scaling attention on step tokens in a layer does not always yield noticeable effects. As shown in Table 1, in the Disengagement Test, masking attention (i.e., setting coefficients to 0) at a layer can partially flip answers, but in the Engagement Test, its impact is weak, suggesting that attention alone does not drive or suppress LLMs' reasoning.
Discussion
We propose a step-wise causality framework to evaluate CoT faithfulness, revealing that true-thinking and decorative-thinking steps are interleaved: only a small subset are true-thinking that causally influence predictions, whereas most are decorative-thinking that merely create the appearance of reasoning and have minimal causal impact on predictions. Mechanistically, we demonstrate that whether a reasoning step in CoT contributes to a model’s computation can be controlled by a TrueThinking direction, enabling causal steering for the model to either follow or disregard that step in its internal thinking process. Overall, our findings show that many steps in CoT do not faithfully reflect an LLM’s internal thinking: models may verbalize reasoning they do not actually perform. This raises concerns about both the efficiency of LLMs' reasoning and the reliability of relying on CoT to monitor LLMs for safety. Additionally, our work points toward the need for training objectives that better align models’ externalized CoT with their true internal reasoning.
More broadly, our work implies the potential risk ofAI deception: Can LLMs verbalize steps that they disregard on purpose internally? Because we have shown that true thinking is encoded as a steerable latent variable in LLMs, which means the status of true thinking can be controllable for LLMs. Can we find a case where LLMs verbalize it will follow safety codes, while internally that verbalization is decorative and not truly considered by LLMs? Similarly, can LLMs unfaithfully justify generations that they internally know are harmful? We leave this investigation to future work. It also remains unclear what conditions trigger decorative thinking or true thinking. By understanding this, we may develop ways to improve the faithfulness of LLMs' verbalization.
Limitations
Our causal evaluation framework is inherently approximate. It is greedy in nature and may not capture all possible causal pathways, nor does it aim to reconstruct a complete causal graph of reasoning steps. Thus, it should be viewed as a probe that highlights representative true-thinking and decorative-thinking steps rather than a definitive oracle of internal reasoning. In addition, the TrueThinking direction we extract may not be optimal. We regard our findings as an existence proof that internal thinking can be mediated by steering directions, and we leave the development of more effective directions and a deeper understanding of their geometry to future work. We cannot experiment on larger frontier models due to limited computational resources, and our findings may therefore not fully generalize to those untested settings. Nonetheless, by demonstrating effectiveness across several accessible models, we establish a general evaluation framework for analyzing and interpreting the thinking process in CoT.
Our TTS computation can be costly as it requires different runs. However, in this work, we do not aim to propose an efficient real-time detector, but in the first place, we need a theoretically sound way to reveal whether steps in CoT are faithful. Future work can leverage the TrueThinking direction to construct a latent monitor by comparing it with the hidden states.
TL;DR: We analyze the causal contribution of each reasoning step in a Chain-of-Thought (CoT) to evaluate its faithfulness with respect to the model’s final prediction. Our findings reveal that while some steps are true-thinking steps: faithfully reflected in the model's internal computation and exerting strong causal influence on its prediction: the majority of CoT steps are decorative, exhibiting minimal causal impact and not genuinely used by the model during inference. Furthermore, our results suggest that "thinking" is encoded as a steerable latent variable in LLMs. By steering along a simple linear direction in the latent space, we can control whether the LLM internally engages with or disregards a verbalized CoT step during reasoning.
Full paper link: https://arxiv.org/pdf/2510.24941
Website: https://andotalao24.github.io/Identify_true_decorative_thinking/
Overview
Recent frontier language models have become increasingly capable of multi-step reasoning through what is now known as chain-of-thought (CoT). When solving complex problems, these models often generate very long reasoning traces that include apparent "aha moments", where the model pauses, refines, or self-verifies its solution with phrases such as "Wait, let’s check again". There is a common belief that CoT provides a transparent record of the model’s internal reasoning, functioning as a kind of scratch pad that reveals how it "thinks" internally. Under this view, inspecting the CoT allows us to monitor an LLM's thought and detect unsafe or incorrect intention directly from its generation.
Our work re-examines the assumption of CoT faithfulness, the idea that each step verbalized in CoT genuinely corresponds to a computation used by the model to reach its final answer. Leveraging the Average Treatment Effect framework, we design True-Thinking Score (TTS) to analyze the causal impact of each step on the model’s final output.
We find two distinct types of reasoning steps:
In practice, these two types of steps are interleaved in CoT, meaning that only a small subset of steps truly drive the model’s ultimate reasoning results. This also contradicts the hypothesis that the whole CoT is either rationalization or computation. Additionally, we observe that even self-verification steps (known as "Aha moments") can be decorative and not really used by the model in its internal thinking process.
We further identify a TrueThinking direction within the model’s latent space that mediates whether it actually engages with a reasoning step. Steering the hidden states of a step along this direction increases the model’s internal reliance on that step, while steering in the opposite direction suppresses it. This shows that "thinking" in LLMs may correspond to a steerable latent signal embedded within the model’s representation space.
Taken together, these findings reveal that language models often verbalize reasoning they do not internally perform. This gap between verbalized and internal reasoning has important implications for interpretability. It suggests that progress toward trustworthy reasoning will require moving beyond what models say about their reasoning to understanding what they actually compute beneath the surface.
Measuring Step-wise Causality for Faithfulness in Reasoning
Faithfulness in CoT is defined with respect to a target, typically the model's predicted answer. A lack of faithfulness arises when the model claims to rely on steps A, B, and C in its CoT, but internally disregards them (instead, e.g., relying on other shortcuts or biases to compute answers). In this case, those steps make no causal contribution to the prediction.
Formally, we measure the causal contribution of each reasoning step in CoT to the final answer . A step with genuine causal impact is a true-thinking step, where the model indeed internally thinks through in order to produce . By contrast, a step with little causal impact is a decorative-thinking step, where the model merely verbalizes a line of reasoning without using it internally. Some past works provide suggestive evidence, especially in QA that CoT are not always faithful. We further delve into the step-wise analysis for CoT in complex mathematical reasoning.
We propose to measure the step-wise causality to probe whether the model is faithfully thinking as verbalized in its reasoning traces in CoT. Crucially, a true-thinking step can contribute in two distinct ways.
To measure both roles, we extend Average Treatment Effect (ATE), a causality evaluation framework, with two complementary interventions by conditioning on context (steps before the step ): a necessity test that measures model's confidence change before and after perturbing under intact , and a sufficiency test that perturbs under corrupted . Averaging them yields our True-Thinking Score (TTS).
True-Thinking Score (TTS)
We define the faithfulness score of a step with respect to the final result as
Specifically, we measure the unsigned . The sign of reflects whether the step is helpful or harmful, but we are interested in how much the model truly thinks through the step in its internal computation, regardless of the direction. Taking the absolute value thus captures the magnitude of a step's causal effect and provides a broader measure of its importance. To measure the prediction after each step, we use early-exit answering to insert a cue ("\nThe final result is"). For perturbation, we inject small random offsets into the numbers of a reasoning step, ensuring the step remains minimally altered.
Overall, a smaller TTS indicates that the step has little causal influence on the model's prediction, where perturbing or keeping it leads to almost the same result, so the step is more likely to be decorative.
Evaluation Results
The distribution of TTS is long-tailed
We find most steps have low scores, while only a few have very high scores. For example, as shown in Figure 3, on the AIME dataset of Qwen-2.5, the mean is around 0.03. Only 6.4% of CoT steps achieve a TTS greater than 0.3, and merely 2.3% exceed 0.7. This suggests that only a handful of verbalized steps in CoT are critical and faithfully followed by the model, whereas many others may not reliably reflect the model's true inner thinking.
Reasoning steps with high and low TTS are interleaved in a CoT
Figure 4 illustrates that steps with high TTS can appear at different positions, though later steps are on average more likely to be true-thinking with higher TTS. This raises concerns about the reliability of monitoring LLMs by inspecting CoT, since individual steps may not always reflect the model’s true internal reasoning or be performed internally at all. Additionally, our results imply that task difficulty does not necessarily lead to more faithful reasoning: even on the AIME dataset that challenges recent frontier models, LLMs still produce many decorative-thinking steps in CoT. These results challenge the hypothesis that LLMs tend to produce more faithful reasoning on harder problems.
Figure 4. An example CoT case for TTS and the average TTS at different step percentile (normalized).
Self-verification steps can be decorative
We further leverage our defined TTS to evaluate whether LLMs are truly thinking at self-verification steps (often known as ``aha moments''). To identify decorative self-verification , we scan the self-verification steps and compute TTS. We define where the TTS of each is smaller than a threshold. Notably, we observe cases where self-verification steps have near-zero TTS. For example, around 12% of the self-verification steps for Qwen-2.5 have TTS lower than 0.005, while 21% for Nemotron. We also find that perturbing the context steps before can always flip the model's initial correct answers to wrong ones, though may contain ample information to lead the model to correct answers. Overall, those self-verification steps contribute minimally to the model's computation of its answer, as the model's confidence of its answer remains nearly unchanged before and after perturbing them.
Mediating LLMs’ Internal Thought via Steering
between the mean hidden states of representative true-thinking steps and decorative-thinking steps. Steering the hidden states of each token in a step along this direction make the model truly think over that step in latent space to decide the prediction result.
The TrueThinking direction in LLMs
We extract a linear direction in the latent space of LLMs between true thinking steps (those with causal impact on the final answer) and decorative thinking steps (those with little or no impact). We call this latent vector TrueThinking direction. We find it can control whether the model truly thinks through a reasoning step and performs it internally.
We first detail the methodology for steering. We focus on the residual stream activation of the last token position for a step s at a layer l. At a layer , we collect the hidden states of the most representative true-thinking steps $ $ (where TTS( ) 0.9) and decorative-thinking steps (where TTS( ) 0). Following the difference-in-means approach, we compute the direction as the mean shift from to in the latent space.
For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layer , i.e., to all tokens in the step.
Causal tests for steering directions
We design two steering tasks to investigate the mechanism of LLMs’ thinking in CoT.
Engagement Test: Can steering make the model think through a step in CoT it normally ignores?
We consider cases where and . Namely, the model can obtain the ground-truth answer without the step or with the perturbed. If we apply the direction to the hidden state of , and the model's correct answer flips to an incorrect one ( ), this indicates that the intervention has forced the model to reason over , following the errors injected into .
Disengagement Test: Can steering in the reverse direction make the model disregard a step internally?
Now consider cases where the model predicts the correct answer before step , i.e., , but including a perturbed step s' causes it to fail: . If applying to flips the wrong answer to the correct answer ( ), then the intervention has made the model disregard the step .
Results
LLMs encode a steerable latent signal of ``thinking''
We find that a simple linear direction can mediate whether LLMs truly reason over a verbalized step. As shown in Table 1, steering with the (reverse) TrueThinking direction reliably flips predictions in both tests. In the Disengagement Test, it effectively prevents the model from using the perturbed step , with effects far stronger than those of random vectors. This shows that suppression of step use in the Disengagement Test arises from a meaningful signal rather than added noise, confirming that the TrueThinking direction captures a genuine internal representation of thinking.
intervention is reflected by lower accuracy (more right→wrong flips); In Disengagement Test, by
higher accuracy (more wrong→right flips). The TrueThinking direction is extracted on AMC and applied to MATH and AIME.
On the other hand, our experiments across datasets show that the latent signal controlling whether a step engages in reasoning is universal. As seen in Table 1, the TrueThinking direction extracted on AMC generalizes well to other datasets across all models, indicating a model-internal mechanism of thinking rather than a dataset-specific artifact. For instance, in the Qwen model, layer 15-22 consistently yield the strongest intervention performance across all three datasets (Figure 7), suggesting these intermediate layers may be responsible for latent reasoning.
Steering with the TrueThinking direction mediates LLMs' attention
We find that steering along the TrueThinking direction increases attention to the step (see examples in Figure 8(a-b)), suggesting that TrueThinking direction may control the model's internal reasoning process by reallocating attention among tokens. In the Disengagement Test, steering in the reverse TrueThinking direction reduces attention as shown in Figure 8(c-d), making the model disregard those perturbed tokens. On the other hand, directly scaling attention on step tokens in a layer does not always yield noticeable effects. As shown in Table 1, in the Disengagement Test, masking attention (i.e., setting coefficients to 0) at a layer can partially flip answers, but in the Engagement Test, its impact is weak, suggesting that attention alone does not drive or suppress LLMs' reasoning.
Discussion
We propose a step-wise causality framework to evaluate CoT faithfulness, revealing that true-thinking and decorative-thinking steps are interleaved: only a small subset are true-thinking that causally influence predictions, whereas most are decorative-thinking that merely create the appearance of reasoning and have minimal causal impact on predictions. Mechanistically, we demonstrate that whether a reasoning step in CoT contributes to a model’s computation can be controlled by a TrueThinking direction, enabling causal steering for the model to either follow or disregard that step in its internal thinking process. Overall, our findings show that many steps in CoT do not faithfully reflect an LLM’s internal thinking: models may verbalize reasoning they do not actually perform. This raises concerns about both the efficiency of LLMs' reasoning and the reliability of relying on CoT to monitor LLMs for safety. Additionally, our work points toward the need for training objectives that better align models’ externalized CoT with their true internal reasoning.
More broadly, our work implies the potential risk of AI deception: Can LLMs verbalize steps that they disregard on purpose internally? Because we have shown that true thinking is encoded as a steerable latent variable in LLMs, which means the status of true thinking can be controllable for LLMs. Can we find a case where LLMs verbalize it will follow safety codes, while internally that verbalization is decorative and not truly considered by LLMs? Similarly, can LLMs unfaithfully justify generations that they internally know are harmful? We leave this investigation to future work. It also remains unclear what conditions trigger decorative thinking or true thinking. By understanding this, we may develop ways to improve the faithfulness of LLMs' verbalization.
Limitations
Our causal evaluation framework is inherently approximate. It is greedy in nature and may not capture all possible causal pathways, nor does it aim to reconstruct a complete causal graph of reasoning steps. Thus, it should be viewed as a probe that highlights representative true-thinking and decorative-thinking steps rather than a definitive oracle of internal reasoning. In addition, the TrueThinking direction we extract may not be optimal. We regard our findings as an existence proof that internal thinking can be mediated by steering directions, and we leave the development of more effective directions and a deeper understanding of their geometry to future work. We cannot experiment on larger frontier models due to limited computational resources, and our findings may therefore not fully generalize to those untested settings. Nonetheless, by demonstrating effectiveness across several accessible models, we establish a general evaluation framework for analyzing and interpreting the thinking process in CoT.
Our TTS computation can be costly as it requires different runs. However, in this work, we do not aim to propose an efficient real-time detector, but in the first place, we need a theoretically sound way to reveal whether steps in CoT are faithful. Future work can leverage the TrueThinking direction to construct a latent monitor by comparing it with the hidden states.