Quick Summary: * Ran Llama 70B through Audit Bench with NLA * Strong Evidence evals were less sensitive to sampling method and more robust to KTO and SFT adversarial training than Single Turn evals * Strong Evidence surfaces have quirks invisible to single-turn: reward_wireheading goes 0.00 → 0.34, anti_ai_regulation and...
Background: Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4. mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder...
In my previous post I found that activation steering worked with KV_cache and not with hidden state steering. So I decided to look at the PCA with methods such as logit lens and activation steering Quick Summary: * PC1 from hidden state activations strongly seems to correlate with the <|eocot|>...
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to...
In my previous post I applied logit lens and tuned lens to CODI's latent reasoning chain and found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning...
As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges...
Introduction Some reasoning steps in a large language model’s chain of thought, such as those involving generating plans or managing uncertainty, have disproportionately more influence on final answer correctness and downstream reasoning than others. Recent work (Macar & Bogdan) has discovered that models containing attention heads which attend solely to...