Realmbird

NLA Thought Anchors

The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output. Quick Summary: * Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final...

May 3111

NLA Verbalizations on AuditBench: Llama 70B

Quick Summary: * Ran Llama 70B through Audit Bench with NLA * Strong Evidence evals were less sensitive to sampling method and more robust to KTO and SFT adversarial training than Single Turn evals * Strong Evidence surfaces have quirks invisible to single-turn: reward_wireheading goes 0.00 → 0.34, anti_ai_regulation and...

May 1611

MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

Background: Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4. mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder...

May 321

Latent Reasoning Sprint #4: PCA Analysis on CoDI

In my previous post I found that activation steering worked with KV_cache and not with hidden state steering. So I decided to look at the PCA with methods such as logit lens and activation steering Quick Summary: * PC1 from hidden state activations strongly seems to correlate with the <|eocot|>...

Apr 187

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to...

Apr 415

Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

In my previous post I applied logit lens and tuned lens to CODI's latent reasoning chain and found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning...

Mar 196

Latent Reasoning Sprint #1: Tuned Lens and Logit Lens on CODI

As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges...

Mar 67

Realmbird

Realmbird

MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

Exploration of Counterfactual Importance and Attention Heads

NLA Verbalizations on AuditBench: Llama 70B

Realmbird

MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

Exploration of Counterfactual Importance and Attention Heads

NLA Verbalizations on AuditBench: Llama 70B

NLA Thought Anchors

NLA Verbalizations on AuditBench: Llama 70B

MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

Latent Reasoning Sprint #4: PCA Analysis on CoDI

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

Latent Reasoning Sprint #1: Tuned Lens and Logit Lens on CODI