Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

Realmbird

In my previous post I applied logit lens and tuned lens to CODI's latent reasoning chain and found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.

This post looked a little deeper looking into some of the questions I had when making the previous post along with new findings that I did not expect.

TL;DR

The CODI tuned lens fails to generalize to non-CODI activations
Final answer detection at even steps 2 and 4 is robust across top-k values 1-10, ruling out threshold artifacts
"Therefore" appears specifically after latent step 3, with detection rate increasing after latent 3 all the way up to 6 — suggesting step 3 and step 5 are qualitatively different despite both being odd computation steps
A linear probe trained to distinguish intermediate vs final answer representations activates most strongly at odd steps 3 and 5, consistent with the scratchpad paper's compute/store alternation — but with step 3 peaking higher than step 5, revealing asymmetry between odd steps

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?

Tuned Logit Lens

In this experiment I in order to create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

Experiments

Confirming Previous Assumptions

Codi Tuned Lens Generalization Failure

In the previous post I had speculation that the codi tuned lens while working on codi tokens failed to generalize to non codi activations.

For this figure I used the tuned logit lens trained on latents 1-6

Confirming the Intermediate answer detection

I tried looking at the intermediate answer detection across different topk values from 1 - 10 as maybe certain latents might have a higher answer detection rate because of the specific topk value.

Regardless of the tuned logit lens approach final answer detection peaked at latents 2 and 4

Investigating the token "Therefore" in CODI

Tuned logit lens with only latent 5 seemed to predict the word therefore a lot and I found this strange.

This made me curious to see if “Therefore” appeared at the topk of the model. An interesting finding is that “Therefore” appeared in the topk of the model only after latent 3 as it increases from latents 4 through 6.

Step 5 may serve a conclusion-signaling role distinct from step 3's computation — the emergence of 'therefore' after step 3 suggests the model commits to an answer at step 3 and signals that commitment at step 5 could help explain why patching the final two latent vectors with random activations does not decrease accuracy in Can we interpret latent reasoning using current mechanistic interpretability tools?

The only exception to the rule of topk having “Therefore” only after latent 3 was the tuned logit lens trained on latent 5 such as Tuned Logit lens trained on Latent (1,5) and only on Latent 5. The tuned lens showed spikes in Therefores at the latents that were odd 1,3 in addition to after latent 3. This could just be a side effect of overfitting from latent 5.

Linear Probes

I created a binary classifier for a linear probe taking the Arena Implementation. For the labels for 1 had the numbers that were final answer emission and for 0 I had numbers that were intermediate steps and not final answer emission. The reason I had the 0 category be intermediate steps is I did not want to train a linear probe to simply activate numbers.

The final answer detection probe activated the most on latents 3 and 5
Linear probes trained on earlier layers like layers 0 activated less than the linear probes trained on later layer activations. Maybe the intermediate vs final answer distinction becomes more linearly separable as you go deeper in the network.
"Note that the dataset had more intermediate labeled (0) samples than final answer samples (roughly 4:1 ratio), so probe accuracy should be interpreted with this imbalance in mind."

Future Work

Investigating the token "Convert" in CODI

When looking at tuned logit lens using latent 1 a interesting token appeared Convert

When using codi Tuned logit lens on non number based questions such as

PROMPT = "Name a mammal that can fly"

Try other methods like patchscopes, activation oracles, etc on the latent reasoning
Attempt to create a thought anchors that work with CODI latent reasoning models
I suspect the Codi model experienced Catastrophic Forgetting outside of math questions
Try activation patching with therefore to see if I can lower accuracy change in latent 3 to latent 5