The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output. Quick Summary: * Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final...
Quick Summary: * Ran Llama 70B through Audit Bench with NLA * Strong Evidence evals were less sensitive to sampling method and more robust to KTO and SFT adversarial training than Single Turn evals * Strong Evidence surfaces have quirks invisible to single-turn: reward_wireheading goes 0.00 → 0.34, anti_ai_regulation and...
Background: Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4. mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder...
In my previous post I found that activation steering worked with KV_cache and not with hidden state steering. So I decided to look at the PCA with methods such as logit lens and activation steering Quick Summary: * PC1 from hidden state activations strongly seems to correlate with the <|eocot|>...
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to...
In my previous post I applied logit lens and tuned lens to CODI's latent reasoning chain and found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning...
As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges...