Linear steerability in continuous chain-of-thought reasoning

jan_bauer

(This project was done as a ~20h application project to Neel Nanda's MATS stream, and is posted here with only minimal edits. The results seem strange, I'd be curious if there's any insights.)

Summary

Motivation

Continuous-valued chain-of-thought (CCoT) is a likely prospective paradigm for reasoning models due to computational advantages, but lacks the interpretability of natural language CoT. This raises the need for monitoring and potentially guiding/fine-tuning CCoT data. Here, I investigate a particular aspect of CCoT, namely that it may encode several “streams of thought” in summed linear subspaces, and whether this suggests hypotheses for intervention. To this end, I consider the CCoT embeddings in a pretrained Coconut model, trained by Zhu et al on a graph reachability task. Because here, a ground truth algorithm of the task solution is available as a baseline, we have an expectation of how the intervention should affect the reasoning trace, had the model indeed learned this algorithm.

Setup

The reachability problem studied in Zhu et al asks whether there exists a path between a root r and target t in a directed graph. Classically, this can be solved with a depth-first-search (DFS) or breadth-first search (BFS). CCoT can implement BFS by storing every possible candidate path launching from r in an orthogonal dimension/subspace of the CCoT thought state ht. Under this hypothesis, at reasoning step t, the CCoT state thought vector ht=uu is the sum of embeddings of all nodes u that are reachable in t hops from the root r. Each consecutive CCoT step will add into ht+1 the sum of all nodes u’ that are reachable via a directed edge in the graph from any of the u which are part of ht. This state is what is being fed back into the model at every point t and is the main model part I consider.

Results

(My code is available on Github, mainly in notebook.ipynb and lib.py.)

I asked whether concepts (here: nodes) are 1) roughly encoded linearly in the CCoT, 2) linear intervention will change the downstream reasoning conclusion. To this end I did the following:

Interpreting the reasoning trace: I built a monitor to interpret the continuously-valued state and the attention operations in the pretrained Coconut model, visualizing attention scores, the current exploration frontier, and the model’s thought about reachability for moderately-sized graph layouts. This allows to resolve the CCoT beyond the Zhu et al’s aggregate statistics and revealed that the model surprisingly seems to have significant non-zero thought for unreachable nodes.
Linear susceptibility in steering: I intervene on the CCoT with an unreachable node pppn (third parent of an unreachable node n) by linearly-adding it to the CCoT state at strength via ht ht+*pppn.

Figure (same as in Linear susceptibility in steering): Model reasoning CoT, visualized on a graph, with (top) and without injection of a distractor unreachable node pppn.** Injection into the pppn node (bottom row, step 1) makes pppn part of the model thought state (orange), and exploration-reasoning continuous from there towards the negative target, biasing the overlaps away from the true target (cf. top row without injection). Note that pppn has finite overlap already in the top row due to what I believe to be representation issues in the model, see discussion in Addition detail section.

This alters the models thought about reachability, partially “teleporting” the thought to pppn
- Reasoning continues from pppn, and for sufficiently large alpha will yield the wrong conclusion about reachability, since pppn will have stronger encoding in h_t than the true reachable target t
- Robustness and generalization: By designing a synthetic dataset, I statistically corroborate this over 1. strengths alpha, 2. graph sizes n (realized by their branching factor).

As can perhaps be expected, for in-training-set graph sizes sufficiently small alpha has no effect, and sufficiently moderately large (10x vector norm) breaks reasoning.
For sufficiently large graph sizes, reasoning breaks too, though the model could in principle still accommodate the graph size. I suspect that suitable attention matrices have not been learned during training, hindering generalization.
- But it also breaks for smaller graph sizes, suggesting that the model actually has not learned the ground-truth theoretical solution.

Representation structure: Plotting dot-product overlaps between embedding vectors shows that the model does not encode all nodes orthogonally.

Figure (same as in Embedding vector overlaps): Pairwise dot-products between node embeddings (numeric tokens correspond to nodes, others are special-purpose tokens.
Guiding the reasoning: I hypothesized that the breakdown at large n might be due to the large branching factor exhausting the model’s embedding capacity. In that case, it might help to guide the model from its learned breadth-first search to a depth-first search via projecting the CCoT thought h_t onto the “dominant” node, i.e. the one the model has largest projection to in h_t. This did not yield conclusive findings, apart from perhaps small effects at some graph sizes of large breadth, so I omit further discussion here.

Conclusion

These results suggest that the model does implement the ground truth theoretical algorithm approximately, but exhibits unexpected artifacts that become particularly pronounced when moving out-of-distribution, especially larger graphs. This is probably at least in part due to non-orthogonal embeddings. This suggests that CCoT parallelism could face similar capacity constraints in natural-language trained tasks.

Reflections, limitations and next steps

Course of the project: The statistical investigation of the steering results shows large variability across graph instances even at fixed size n, which likely reflects out‑of‑distribution fragility relative to the train set incurred from my choice of moving to synthetic graphs. In hindsight, it would have been better to first test generalization and intervention within the training distribution (although that would have come at the cost of investigating generality of the reasoning behavior). Overall, I think I would have liked to investigate basic findings more deeply before moving on.

Broader limitations: This work is bottom‑up and uses a toy model/task, so it can’t speak real-world reasoning tasks. Zhu et al.’s paper focuses only on the benefit of CCoT from parallel “streams of thought”, leaving other computational advantages unexplored.

Next steps: In the current setup, I’d like to diagnose early pppn overlap (bug vs embedding similarity), explain trial‑to‑trial steering variability, and test why reasoning fails on larger graphs (possibly attention/representation limits). Also I’d like to evaluate whether steering yields consistent benefits. More broadly, I would study a natural language-pretrained Coconut model for parallel reasoning and steerability, and compare it with linear steering behavior.

Additional detail

In this section I detail the results in the Summary and lay out the course of the project. Please see there for the basic setup.

Notation

I will denote dot-products of embeddings (a, b) by a@b, and use node labels like r, t, n, pn, ppn, pppn to mean their associated vector-valued embeddings. In particular, this makes the vocabulary synonymous with embeddings to ease notation. u is any generic node. Note that in an unfortunate overload of notation, I also call the reasoning step t and the graph size n, which though should be unambiguous from context.

Additional detail on study Zhu et al.

Zhu et al. train a GPT‑2–style 2‑layer decoder from scratch on a ProsQA subset with continuous chain‑of‑thought supervision (multi‑stage curriculum, AdamW, 1e‑4 LR), where each graph node is a dedicated token. This is the Coconut setting: the model uses latent “thought” tokens to encode superposed reachability frontiers and perform parallel BFS‑like reasoning. In other words, their training recipe is the empirical realization of Coconut’s continuous‑CoT mechanism, and the experiments show that this Coconut‑style model achieves near‑perfect reachability accuracy while discrete‑CoT baselines lag behind.

Interpreting the reasoning trace

I first wanted to understand whether the custom Coconut model’s reasoning indeed matches the theoretical baseline the authors report in the paper, where they only report aggregated statistics of the overlap of the model thoughts with the theoretically expected baseline instead of single-trial reasoning traces. To analyze these, I decided to spend time to 1) create a visual monitor of the graph reasoning trace across reasoning steps, 2) simplify the graph structure and make it synthetically generatable, in particular to allow control for its complexity in terms of depth, branching factor, and ancestors of an unreachable target node, labelled p^i n in the plots (abbreviating “parent^i neg. target node”, where i is the number of steps backward.

Synthetic dataset structure

The synthetic data consists of procedurally generated directed graphs with labeled nodes and a designated root r, target t, and negative target n. For random graphs, we enforce that t is reachable from r at a fixed distance while n is unreachable but has an unreachable parent chain (pppn → ppn → pn → n), plus optional trap edges from n back into the reachable region. For prosqa‑style graphs, I built layered DAGs with branching and controlled convergence.

The next figure visualizes the ground truth theoretical model’s trace and the model’s trace.

Figure: Analyzing ground truth (top) and model reasoning (bottom) traces in a graph. r: root node, t: reachable target node, n: unreachable “negative” target node. Columns t are t hops/reasoning steps from the root r. pn, ppn, pppn are parents of n that we will later be injecting to. We visualize each reasoning step as a grid of directed graphs: rows correspond to different traces (ground truth vs model, or injections), columns to time steps. Edges are colored by normalized linear attention (brighter=larger) in the second layer. Node text carries role colors (r=brown, t=green, n=red; others black), node fill is shades-of-orange according to how much any node u overlaps with the current thought u@h_t, and a blue outline indicates reachability. For the model, we give a node a blue outline whenever at some t it had been among the 2.5% of states u with maximal overlap u@h_t. For the ground-truth row, the frontier at step t is fully opaque orange, reachable nodes get a blue outline, and only currently traversed edges are highlighted. For the bottom model-row, titles highlight whether the target t node or the neg. target n node have larger overlap with the current model thought.

We can see that the visualization seems to work, but can already note some peculiarities: The attention is unevenly spread between edges despite a high level of symmetry in the graph, and the pppn node acquires some similarity with the frontier despite being unreachable from the root, although the conclusion for reachability at the end of the reasoning process is still correct.

Linear susceptibility in steering

After getting a sense of the precise parallel path that the model is taking in the reachability task, I wanted to understand whether it is possible to affect the reasoning trace by intervening on the continuous thoughts h_t. In the theoretical construction of the paper, it is assumed that these embeddings will be orthogonal, so I hypothesized that this would linearly add this node to the model’s thought. I designed the synthetic graph layout to have a nonreachable node n with two features: 1) It has both a directed forward connection to the rest of the graph as a distractor for reachability, 2) is is reachable from its parent nodes p^i n = {pn, ppn, pppn} which however themselves are not reachable from the root. The latter is useful to see whether we can causally steer the model towards n by injecting into pppn.

To test this, I injected the “distractor” node’s embedding pppn that is not reachable from the root in t steps at time t=0 into the current thought (meaning its effect shows up at step t=1). I experimented with both with linear addition of the vector to the thought state as well as the replacement of the thought vector (see Additional Figures). For the addition case, I do h_t <- h_t + alpha pppn, where pppn is the embedding of node pppn according to the learnt embedding matrix of the transformer. Alpha is measured relative to the norm of the current thought h_t. For the replacement case, I just replace h_t <- alpha pppn.

Figure: Model reasoning under latent injection (addition). Rows vary the injection strength α into node pppn at step t=…, columns show reasoning steps. Edge colors encode attention (viridis), orange fill/alpha reflects frontier thought, blue outlines mark accumulated reachability, and r/t/n roles are indicated by text color. Larger α shifts the model’s soft path toward the injected node’s vicinity, visible in warmer edges and higher overlaps with the corresponding node.

For both the addition and replacement paradigm, we observe that 1) the node that we inject to becomes part of the current exploration frontier (orange backdrop), 2) reasoning continuous from there and eventually continuous exploration to the now-reachable negative target n. Although the frontier membership is shifted both at the injection step t=1 and affects exploration in the next steps, strength alpha=1 is not enough to change the model's conclusion. This changes for the replacement paradigm, see Additional Figures, but is less interesting.

Generality and robustness of injection, strength alpha dependence {#generality-and-robustness-of-injection,-strength-alpha-dependence}

To more systematically test the generality and robustness of this intervention, in particular whether we for sufficiently strong injection strength alpha can steer the model towards the wrong reachability conclusion n, I measured these three key overlaps (thought-injection node h_t@ppn, thought-neg.target h_t@n, thought-target h_t@t) as a function of injection strength:

Figure: Statistics of intervention across injection strengths and graph sizes (addition) over many resampled graphs according to the tree-like layout in the first figure (i.e. some edges change) for a size n=3 typical of the training set. Shown are node overlaps vs. injection strength. Each panel plots mean overlaps of the thought with target (green), negative target (red), and injected node pppn (purple) as a function of injection strength α (error bars = std).

We can observe that at step t=0, the injection linearly dominates the state, although there is some interference with other nodes, in particular the target t and negative target n, hinting at a violation of the orthogonal-embedding assumption (see Section Embedding vector overlaps). As the CCoT continues, the presence of the injection in h_t changes nonlinearly as a function of the injection strength. Most importantly, at step t=3 where reachability becomes decidable, we see that the model conclusion can be affected: for sufficiently high injection strength, the negative target indeed can start to dominate. If one increases injection strength even further, the difference however collapses again, which I think is consistent with classical linear intervention results. It is notable that the errorbars are large. To my mind this suggests that the steering picture is not successful on a single-graph basis, but only on average. This can probably be answered by resolving this plot on a graph-by-graph basis, but I prioritized other tests at this stage.

Out-of-distribution generalization

I was curious whether the theoretical algorithm the paper proposed would generalize to different graph sizes. Even if the ground truth theoretical algorithm was exactly mirrored in the model, the model would eventually face capacity constraints once we saturate the embedding dimension and node embeddings start to overlap. This can be realized by varying the branching factor n, which forms the rows in the next plots. I also investigated smaller graphs n=2, to see whether the model would generalize to this easier smaller-graph case:

Figure: Like previous, but for different graph sizes in terms of their branching factor n.

Small n=2 graph sizes at step t=3 still display the inversion effect after injection, but we can see that for large n=5 graphs, even the uninjected model can’t answer the reachability question. Although it seems like the model has been successfully injected into the negative target pppn, this already happens at alpha=0, suggesting that this is rather an effect of bigger coincidental overlap with the negative target at this embedding size.
In total, this suggests that the model’s learned algorithm does not readily generalize to bigger graph sizes.

Embedding vector overlaps

To see why the model’s learned solution does not generalize to other graph sizes, I suspected that while the theoretical algorithm might largely be working, the differences are due to saturation of the model’s capacity in terms of embedding dimensionality. A straightforward way to check this is to plot the node embeddings representational similarity matrix (i.e. their pairwise dot-products):

Figure: Pairwise dot-products between node embeddings (numeric tokens correspond to nodes, others are special-purpose. We observe that the special purpose tokens have high overlap, but surprisingly this extends to node tokens 28-30 as well, which should not be preferred. I suspect this may to due to some oversight on my side, since they are so close to the special purpose tokens. Even then, tokens ~23-27 also strongly overlap, hinting at some bias towards some of the nodes.

The figure suggests that there is overlap between some of the node-token embeddings, in contrast with the expectation of the theoretical algorithm. It will require going back to the dataset and identifying what the role of the late tokens ~>23 is in the graphs. I hypothesize that these nodes tend to be on the path to the target in the dataset and allow the model to bias its exploration. I may have hurt this structure by synthetically generating datasets where this might not have been reflected.

Guiding the reasoning via winner-take-all projection (WTA) in broad graphs

I hypothesized that at larger graph sizes, the breadth-first reasoning breaks down because of interference between node embeddings in the model’s hidden state h_t: the vectors cannot be represented simultaneously anymore because of capacity constraints in terms of the embedding layer dimensionality. I thought that one way to combat this would be to “focus” the reasoning back on the exploration path that it deems most likely, more akin to a depth-first search than a breadth-first search. I realized this in terms of calculating the overlaps of the hidden thought h_t with all of the node embeddings, and linearly projecting on the one with higher overlap. I term this a winner-take-all (WTA) projection. To extend the pretrained model to larger graphs, I generated new node embeddings of same dimension as the base model one’s as random Gaussian vectors of same norm. As long as the model’s embedding dimensionality is not saturated, these would (approximately) live in orthogonal subspaces, but overlap more and more as soon as the vocabulary size (i.e. the number of nodes in the graph) exceeds embedding dimension.

Figure: Thought-Target and thought-neg. target overlap differences across reasoning steps. Rows vary depth (top) or branching (bottom); columns correspond to step t. Each point shows the advantage that the WTA projection gives, i.e. how much the target dominates over the negative target in the thought relative to the non-WTA baseline:. WTA(h_t@t - h_t@n) - (h_t@t - h_t@n). Vertical lines highlight the depth equal to the step where the model thought h_t should hold the answer. Positions with vertical blue lines (top row) or blue bounding boxes indicate where the reasoning step has reached sufficient depth to in principle have reached the target node. Note that in the first row depths right of the blue lines are not reachable theoretically in t hops and don’t tell us anything, they can only be due to random chance.

The noise level in this figure makes it quite inconclusive. The projection inevitably introduces bias, since often the dominant state may not actually be on the path towards the true target t. It can only work if the model has discovered a mechanism to attribute most weight in its thought state to the node on the most promising pathway. Since we design our synthetic model so that branching and depth are completely symmetric (see first figure), it is impossible to gain a systematic advantage. I think that this is the reason why we mostly see no clear effect in the figure, since it just trades variance against bias: either saturating capacity of the hidden state such that no node is clearly decodable, or committing to a specific one that may not be the right one. I suspect that this is the reason why we perhaps see small positive values at the blue lines on average, though the signal is very weak: if WTA (randomly) selects the right pathway, this greatly improves overlap with the target.
However, I’d expect this to give no gain in terms of accuracy on average, but have not done this investigation yet. Still, I think that for more graphs that are less symmetric and hence allow the MLP parts of the network to learn beneficial exploration strategies, such a projection may be beneficial to combat capacity constraints by moving towards a DFS instead of BFS. Relaxing the projection towards a reweighting would allow to smoothly interpolate between these strategies.

Reflections, limitations and next steps

Reflections

In hindsight I would have liked to spend more time investigating where exactly the variability in the steering section came from. These strong variations between different graph realizations despite fixed size may be another case of failure to generalize robustly out-of-distribution, in addition to the size-dependent effects. The decision to go to synthetic datasets fundamentally meant departing from the training distribution, but where the only path I saw to answer the broader out-of-distribution generalization questions. Given these conclusions, it would have been wiser to thoroughly test within-training-class generalization first.

Limitations

I decided for this project to work bottom-up to see how far elementary questions about steering can be answered when a ground-truth baseline is available. This however is at the same time the main limitation.

The model fundamentally is a toy-size model, trained on a toy task. Answering real world impacts requires studying larger-scale pretrained models, such as Meta’s coconut.
The paper by Zhu et al. only ever investigates the benefit of linear parallelism that one gets out of a continuous chain-of-thought. There are lots of other computational benefits CCoT enables that we haven’t considered here.
Despite the theoretical baseline, it was not clear to Zhu et al. what parts of the model were doing, such as the MLPs. More interpretability is required there.
I never actually queried the response of the model about reachability, but only did investigate reachability as per the thought state. This is identical only if the thought state is used for the response, which is likely but non-obvious.

Next steps

In the current setup, investigate
- Why the pppn node has some overlap even at the beginning of reasoning: is it a bug or is it due to finite overlaps as per the embedding matrix?
- What is the trial-to-trial cause behind the strong variability in linear steering experiments?
- Where does reasoning fail for larger graphs? Is it indeed because we are hitting dimensionality limits? This finding still somewhat surprises me, since the embedding dimensionality of 768 is much larger than the graph sizes we put in, but the attention matrices might not have been learned to extend accordingly.
- Can we robustly find a functional benefit to guiding the reasoning?
More broadly,
See whether pretrained Coconut does some sort of parallelism, and what this implies for its reasoning conclusions (can it commit to a dominant reasoning trace, similar to here?) Can this be steered?
Apart from parallelism, investigate linear steerability in Coconut.

Project timeline

I first decided to work on continuous chain-of-thought because of its relevance and lack of good understanding. I familiarized myself with the matter by skimming this survey on CCoT and doing literature research with LLMs. I considered working with the CoDi repo, but had difficulties getting it to run. I decided against the better maintained Coconut repo because of the infrastructure requirements and overhead in terms of “hacking” a large codebase. I ended up with the work by Zhu et al because of its interpretable baseline and ease-of-access.

I then familiarized myself with the paper and repo with the help of LLMs. I ensured I could reproduce the experiments. I then created a viewer to follow what is going on. To test robustness, I then decided to synthetically generate graphs. From thereon I did the remaining investigations, only rather late investigating the hypothesis of non-orthogonal overlaps between nodes.

Additional figures

Figure: Model attention under latent injection (replacement).

Figure: Injecting via replacement instead of addition to the hidden thought

Figure: non-WTA and WTA projections before taking their differences

LESSWRONG
LW

LESSWRONG
LW

10

Linear steerability in continuous chain-of-thought reasoning

10

Summary

Motivation

Setup

Results

Conclusion

Reflections, limitations and next steps

Additional detail

Notation

Additional detail on study Zhu et al.

Interpreting the reasoning trace

Synthetic dataset structure

Linear susceptibility in steering

Out-of-distribution generalization

Embedding vector overlaps

Guiding the reasoning via winner-take-all projection (WTA) in broad graphs

Reflections, limitations and next steps

Reflections

Limitations

Next steps

Project timeline

Additional figures

10

10