StefanHex

Message

Stefan Heimersheim. Mechanistic interpretability & AI safety researcher, previously at FAR.AI and Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

2154

264

129

StefanHex

StefanHex

StefanHex

You can remove GPT2’s LayerNorm by fine-tuning for an hour

[Interim research report] Activation plateaus & sensitive directions in GPT2

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Attribution-based parameter decomposition

StefanHex

You can remove GPT2’s LayerNorm by fine-tuning for an hour

[Interim research report] Activation plateaus & sensitive directions in GPT2

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Attribution-based parameter decomposition

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Activation Plateaus: Where and How They Emerge

Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability

Trusted monitoring, but with deception probes.

Compressed Computation is (probably) not Computation in Superposition

Try training token-level probes

Proof-of-Concept Debugger for a Small LLM