Latent Reasoning Sprint #1: Tuned Lens and Logit Lens on CODI

Realmbird

As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges from its stated reasoning, or enabling early exit once the answer has crystallized. This post explores whether those tools work on CODI's 6 latent steps and what they reveal about its internal computation.

TL;DR

I applied logit lens and tuned lens to probe CODI's latent reasoning chain on GSM8K arithmetic problems.

Key findings:

The direct-trained tuned lens reveals the final answer in odd latent steps (1, 3, 5) while plain logit lens cannot. Tuned lens with only latents 3,5 is not able to decode the the final answer
Final answer detection rate is highest at even steps (2, 4, 6), consistent with even steps acting as storage
Entropy peaks at odd steps 3 and 5, consistent with active computation happening there
Training translators directly on CODI latent hidden states surprisingly underperforms translators trained on text tokens — suggesting latent vectors remain close to text token geometry

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?

Tuned Logit Lens

I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

Experiments

The model in Intermediate Calculations contains the final answer.

PROMPT = "A team starts with 3 members. They recruit 5 new members. Then each current member recruits 2 additional people. How many people are there now on the team? Give the answer only and nothing else."
Answer = 10
I looked at the same prompt as the lesswrong post “Can we interpret latent reasoning using current mechanistic interpretability tools?”

In the figure above which is a Tuned Logit lens trained on the gsm8k with 500 samples and 3 epochs.

An interesting feature of the Tuned Logit Lens on non CODI tokens is that unlike normal logit lens it demonstrates that in the CODI latents the model stores information about the Answer 10. The answer 10 for the logit lens seemed to be absent from the even latent vector indexes of 2, 4, 6
This is different from normal logit lens where it it only shows the intermediate steps of 8 and 16 but, never shows the final answer of 10

Tuned Logit Lens on CODI latents

Motivated by the strong performance of the direct-trained tuned lens, I trained a second set of translators directly on CODI's latent hidden states, hypothesizing that a lens specialized to latent-space geometry would outperform one trained on text tokens. This however, was not the case.

Unlike normal logit lens or even the directly-trained tuned lens the CODI tuned lens always outputted end of cot tokens for latents 6 and beyond
There seems to be a noticeable difference in the tuned logit lens depending on the latents used to train it especially between even and odd latents

Tuned Logit Lens CODI

Training tuned logit lenses on CODI latents seemed to cause the logit lens to mirror the final layer which suggests over-fitting.

Tuned Logit Lens CODI Latents Even

Training tuned logit lenses on CODI even latents of 2,4,6 seemed to cause the logit lens to output a lot more text tokens or \n tokens. The final layer can be ignored since tuned logit lens does not train the last layer.

Tuned Logit Lens CODI Latents ODD

Training tuned logit lenses on CODI odd latents of 1,3,5 seemed to cause the logit lens to output to do the opposite of the even logit lens as it seems that it only ouputted numbers even before the latent reasoning as seen with latent vector index 0. It was able to find the final answer of 10 however, it was unable to produce valid outputs for non-latent reasoning activations as seen with how it did not fully decode latent vector index 0.

Tuned Logit Lens CODI Latents (3,5)

If tuned lens is only using 3, 5 the 10s do not show up adding 1,3,5 allow the the logit lens to find the final answer.

Tuned Logit Lens on Entropy Over Layers

In order to explore the differences in logit len outputs between the different Tuned Logit Lens I looked at the Entropy.

Plain Logit Lens seemed to have the highest Entropy with it only dropping at the very last layers
Direct Tuned Logit lens seemed to have the lowest entropy compared to CODI tuned lens and default logit lens
An interesting finding is that latents 3 and 5 have the highest entropy across the 3 logit lens variations excluding latent 1.

Tuned Logit Lens on Intermediate answer detection

The tuned logit lens containing the final answer made me curious for outside the top token in the topk what was the final answer emission for the different tuned logit lens and which latents predicted the final answer at the highest rates.

Directly Tuned Logit Lens did the best and had the highest final answer detection rate
Surprisingly the latents that had the highest final answer detection rate in the top k tokens were the even tokens 2, 4.

Discussion

The latent Intermediate answer detection having the highest rates for latents 2 and 4 is consistent with this lesswrong post.
The high entropy values for latents 3 and 5 along with the fact that for the tuned logit lens the final answer appears only in the odd latents of 1, 3, 5 seems to be consistent with the results from this lesswrong post as the reason for why overriding latents 3 and 5 had a considerable decrease in accuracy could be explained by the findings for the high entropy values for latents 3 and 5.
The even latents of 2 and 4 containing the intermediate answers the most when doing the top k possibly can be explained with the Scratchpad Thinking paper finding that even latents steps store and access numerical information.
Together these three findings provide converging evidence from logit lens and entropy analysis that CODI follows the compute/store alternation proposed in the Scratchpad Thinking paper

Future Work

Observe Final Answer Detection Rate and see if latents 2 and 4 have the highest accuracy for different top ks
Try other methods like patchscopes, activation oracles, etc on the latent reasoning
Attempt to create a thought anchors that work with CODI latent reasoning models

7