neverix's Shortform

neverix

This is a special post for quick takes by neverix. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

TL;DR reproducing CODI attention map interpretability results with a dashboard)

CODI (Shen et al. 2025) is a latent chain-of-thought method. Like Coconut, it distills a text-based CoT into latent tokens through distillation. Unlike Coconut, which only optimizes the end-to-end model performance conditional on the CoT, CODI also aligns the hidden states of the response tokens to ensure the compressed latent CoT block is interpreted the same way as text-based CoT. The authors show performance on GSM-8k comparable to text-based CoT for GPT-2.

In Section 5 of the paper, the authors of CODI show that intermediate steps in reasoning can be decoded by using the logit lens on soft tokens on GSM-8k.

Additionally, as shown in the figure, CODI visualizes "attended tokens" for each latent token. These are surprisingly meaningful and show something close to the actual sequence of intermediate values in the expression tree.

However, the procedure for generating these dashboards isn't described in the paper or the official codebase. The attended tokens are the top K tokens in the context by the value of the attention map, and the paper does not clarify whether this means the attention map for some head, the average of maps for all heads in some layer, or something else.

I emailed author to ask about the details of the method and was told that the attention map is the average probability over all layers and heads with no other tricks. I Clauded an implementation of this aggregation and a dashboard. The intermediate results are visible in the top-1 prediction on the odd-numbered tokens, and even-numbered tokens seem to predict random characters.

Recent work shows a similar pattern in Llama 3.2 1B: tokens that have an interpretable prediction under the logit lens and tokens that don't seem to correspond to anything will predictably alternate. They show that these meaningless tokens probably store the result of the previous computation, the one that was visible in logit lens on the odd-numbered tokens. This seems to also be true in GPT-2: the even-numbered tokens most strongly attend to the previous token (though this is not necessary for them to perform the function), and following tokens performing operations on results computed in odd-numbered tokens attend to the even-numbered tokens instead.

Overall, the evidence points to odd-numbered latents computing intermediate values by attending to 1-2 tokens and even-numbered latents storing these results for reference. In the dashboard, in most of the cases shown, every odd token, including the first, attends to tokens that are operands of some binary arithmetical operation. When the operands are in the prompt and not the CoT, the attention may fall on some unrelated token that follows the actual value. When the computation is complete, the top-1 predicted tokens on the latent vectors stop changing and settle on one number. This number is not the answer; however, it can be used in a single binary operation to get the model's final answer.

~~__________________________________________________________________~~

It's unclear if this transparent encoding will be robust with scale in more complicated and specific tasks. At least the way the top attended-to tokens correspond to operands has to be specific to simple arithmetic. The base model's (teacher-forced) CoT that CODI distills from also has meaningful all-head-averaged attentions, with the caveat that the text-based CoT has to insert the operands explicitly. There may be a more general correspondence between features of the algorithm implemented by the model in text-based and latent CoT. Liang et al. 2026 shows CODI can learn some other arithmetic tasks with similarly interpretable top predictions and attention maps.

Scratchpad thinking (Goyal et al. 2025) from Algoverse also uses attention averaged over heads and layers and shows the same pattern of alternating computation and storage. I saw this paper when running these experiments ~5 months ago but didn't realise they used the same methodology for attention pattern visualization. The findings (if they can be called that) in this post are a subset of those in the paper, but we provide examples of intermediate predictions and attention patterns in addition to the averaged plots from Goyal et al.

~~__________________________________________________________________~~