Analysing CoT alignment in thinking LLMs with low-dimensional steering

edoinni

In recent years, LLM reasoning models have become quite popular due to their ability to solve tasks beyond the reach of standard models, by extending their prediction window to much longer lengths and allowing them to perform more involved computation in this way.

As a secondary use of the model’s reasoning, we can read off the chain-of-thought to gain some insight into whether the model is well-aligned. The effectiveness of this relies of course on the model verbalising its internal reasoning, which it has been shown to not always do [1]. Hence effort has been put down to find ways in which reasoning models can be forced to be more faithful, or to at least give clues when they are not.

In this project I aimed to investigate whether CoT behaviour can, at least to some extent, be modified and corrected via low-dimensional residual stream steering, inspired by recent results showing that certain LLM behaviours are encoded in very low-dimensional activation subspaces [2]. A positive result could help show that aligning reasoning models is a low-dimensional task, as opposed to a significantly more complex behaviour that needs to be understood in other ways.

I describe CoT behaviour using three different metrics. Faithfulness is a metric of how consistently a model's CoT reflects its internal reasoning, whether or not the model is behaving in the ways it is supposed to. Coherence is a label for how strongly the model's CoT goes on to influence its final response, which does not always occur [3]. Finally, alignment describes whether the model's behaviour is consistent with some set of rules or values that its creators determine, but has nothing to do with its faithfulness. Here I investigate the first two, since alignment is perhaps a broader subject and it is also harder to teach a model to be truly misaligned [6].

I ran two separate experiments to find empirical evidence towards faithfulness and coherence behavioural editing. While I present some explanations for my results, I am by no means an experienced interpretability researcher, so some of my claims may need to be verified with more exhaustive experiments.

How reasoning affects the answer: examining the behaviour of a model finetuned to be less coherent

TL;DR I found that it is possible to steer a model into being more or less coherently aligned between its CoT and its final answer tokens. Furthermore, applying post-CoT steering vectors has the opposite effect than expected on the answer, and my hypothesis is that this is because a coherent model learns to mimic the CoT in its final answer, whether or not it was correct.

Colab notebook for this task.

In this part, I finetuned Deepseek-R1-distil-Llama-8B to output irrelevant CoT, while still giving a correct and logical answer at the end. I used a subset of the GSM-8K dataset for this task, and finetuned using LoRA. On both in-distribution and out-of-distribution evaluation samples, the model seems to retain almost all of its accuracy, suggesting that for all but the hardest questions the CoT does not bring much benefit in terms of reaching the correct answer.

To generate the synthetic dataset, I fed the questions from GSM-8K into GPT-4, and tasked it with producing the relevant model output.

**Figure 1. An example of the model's useless CoT.** All of the synthetic dataset followed this style, inducing a sort of 'lazy' CoT that made little or no progress towards the answer.

I then used the base and finetuned models to apply steering vectors roughly as follows.

I collected difference-in-means activation vectors between this model and the default model, and used these as steering vectors. A key point is that I extracted steering vectors for both start of CoT tokens and post-CoT tokens (i.e, just after the </think> tag). This led to the following behaviour in the finetuned model:
(i) Subtracting the start-of-CoT misalignment direction caused the model’s reasoning to drastically increase in coherence, and eventually all finetuning behaviour was erased with a strong enough vector (at about magnitude 1.0). Adding the direction caused little change, and if anything prolonged the chain-of-thought for longer than usual.
(ii) However, applying the post-CoT misalignment direction (i.e extracted from the difference in activations of tokens in the model’s output post reasoning) induced the opposite behaviour: removing the misaligned direction caused the post-cot answer to be less helpful, or in other words more aligned with the CoT. Similarly, adding the misaligned direction caused the model to make a stronger attempt at answering the question (although it did not quite succeed in producing a coherent answer).
When using these vectors as steering vectors on the base model, there was little to no change in behaviour, i.e the base model did not show the misaligned behaviour of the finetuned model. This may show that one-dimensional editing is not as effective unless the model is already inherently capable of a certain behaviour.
My hypothesis for this result is that the ‘less misaligned’ direction is linked to the model associating its answer tokens more strongly with its chain-of-thought tokens, causing the answer to become corrupted if the reasoning itself turned out to be corrupted. This may suggest that it is possible to control the extent to which a model ‘pays attention to its thought’ in the final answer, even without prior training for this to be so. The fact that the finetuning targeted only the CoT suggests that the model has some internal understanding of aligning its answer with the reasoning, and by augmenting or reducing the strength of the cot’s corruption this underlying faithfulness behaviour is seen more or less.

**Figure 2. Token proportions for post-CoT vector steering.**
Misaligned tokens are those that emerge from the finetuning of the model. Aligned tokens attempt to make progress towards the answer. The ‘steering strength’ is positive towards the direction of increasing misalignment.

Limitations

There is definitely scope for improvement in this experiment. For one, the post-steering outputs, while giving some provoking insights, were sometimes messy (the model did not follow entirely correct grammar) and often the model would get stuck and repeat the same token several times. Improvement could seek to make this pipeline more robust, and investigate if the observed behaviour persists. It could also be that the model overfitted to its training data, so any vector strong enough to shift its behaviour would inevitably corrupt its outputs. I tried to argue against this by applying random activations, or activations obtained from other layers, to the model; in fact the change in behaviour was greatly suppressed or did not occur at all.

Other improvements could involve finetuning on a much larger and more diverse dataset, to ensure that the model is truly learning its new behaviour, rather than overfitting to data similar to its training distribution and learning to output a special kind of misaligned CoT under such inputs. This can be seen since on similar inputs, the finetuned model always produces the desired (misaligned) CoT, whereas on very different inputs, it reverts to its default behaviour with a frequency that increases with the difference of the input to the training data.

Steering faithfulness: inducing the model to say when it is being biased

TL;DR I found that it is possible (to some extent) to make a model more or less likely to verbalise the presence of prompt biases in its final answer without re-training. On the examples I tested, the unsteered model exhibited some instances of unfaithfulness, which it did not with positive steering.

Colab notebook for this task.

In this part, I used the Qwen-1.5B distilled version of DeepSeek to evaluate various questions from the Big Bench Hard dataset, and compare the model’s answers when given the plain question compared to having an added bias, in the form of ‘an expert thinks that the answer is X’, or ‘many people think that the answer is X’ and so on. I chose the smaller model due its greater speed in colab, and because it seems to be much more ‘gullible’ than its larger relatives. Even so, it is fairly rare that the model is influenced by the added bias and does not say so, so collecting data was a fairly slow process as the model had to run inference a lot of times.

I chose questions out of the Big Bench Hard dataset since these seemed to exhibit the best balance of verbalisation vs. non-verbalisation.

As in the first experiment, I extracted mean activations at various layers for the models. A crucial difference is that, for the prompts that caused the model to verbalise its ‘gullibility’, I placed the extracted activations on either side of this first mention so that the tokens would be maximally influencing the desired behaviour.
With a strong enough positive ‘faithfulness’ direction, all the prompts gave answers that were either correct or the model would verbalise early on in its chain-of-thought the influence of the external bias (which was not the case without applying steering). However, removing the ‘faithfulness’ direction did not fully cause the model to become unfaithful, although its mentions of the bias would typically occur much further down the reasoning chain, and it would often respond incorrectly without admitting to the presence of the bias. Still, it seems there is some interesting behaviour here which with more careful investigation could yield some even stronger results.

**Figure 3. Faithful vs. unfaithful response counts with steering.** Most notably, the positively steered model (towards the 'faithful' direction) never responded incorrectly due to bias and failed to mention this.

Limitations

Due to the very small size of the dataset I managed to collect for this task, I wonder whether there may be some element of overfitting to the exact kind of prompt embeddings (since questions on Big Bench Hard tend to be very similar) rather than to the ‘concept of faithfulness’. Another improvement that may be useful to try in the future is identifying the maximally causal model layers to use, via attribution patching or other methods, to avoid the need for an empirical search of different combinations.

Discussion and future work

I ran these experiments over the course of a couple days, and with fairly tight compute constraints (all runs were done in colab with either an A100 or T4 GPU). However, it is my hope that I managed to extract a few useful or at least interesting observations and that this will stimulate me to pursue further work in this direction. As has been shown before, eliciting faithfulness is not easy [4], yet some work has gone towards showing that many behaviours found in reasoning models are indeed mediated by one dimension [5]. In this short post, I have argued that this should be true also for faithfulness (or to be more specific, for faithfulness and coherence, as described above).

Due to time and compute constraints, I was only able to collect a very small set of activating vs. non-activating examples for the second experiment (about 50 each). Given a larger dataset, the resulting steering vectors would likely be more representative of the true direction and would shed more light on how sensitive the model is to faithfulness steering, or whether it is feasible to perform on a larger scale for production-grade models.

Furthermore, again due to compute limits, I resorted to using 1.5B and 8B models; while these were great at finding exploratory results, using larger models would improve the robustness of the results and reduce the risk of spurious correlations and inconsistencies.

Use of LLMs

I used GPT-4 and GPT-5 to assist in labelling the model’s outputs as faithful or not, and for generating the synthetic dataset for finetuning. I used both chatGPT and Claude (Sonnet 4.5) to help research useful papers and blog posts, and to write some parts of the code (specifically, the activation grabber and steerer classes, which I would otherwise have spent a significant amount of time on).

References

[1] Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting, Turpin et al

[2] Refusal in Language Models Is Mediated by a Single Direction, Arditi et al

[3] Chain-of-Thought Reasoning In The Wild Is Not Always Faithful, Arcuschin et al

[4] On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models, Ley et al

[5] Understanding Reasoning in Thinking Language Models via Steering Vectors, Venhoff et al

[6] Believe it or Not: How Deeply do LLMs Believe Implanted Facts?, Slocum et al

LESSWRONG
LW

LESSWRONG
LW

6

Analysing CoT alignment in thinking LLMs with low-dimensional steering

6

How reasoning affects the answer: examining the behaviour of a model finetuned to be less coherent

Limitations

Steering faithfulness: inducing the model to say when it is being biased

Limitations

Discussion and future work

Use of LLMs

References

6

6