In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to...
In my previous post I applied logit lens and tuned lens to CODI's latent reasoning chain and found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning...
As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges...
Introduction Some reasoning steps in a large language model’s chain of thought, such as those involving generating plans or managing uncertainty, have disproportionately more influence on final answer correctness and downstream reasoning than others. Recent work (Macar & Bogdan) has discovered that models containing attention heads which attend solely to...