Message

Neuroscientist researching how brains decide their own learning strategies. Amateur AI Safety researcher since that seems important.

Technical writing at alexeigannon.com

2mo

Alexei G

Neuroscientist researching how brains decide their own learning strategies. Amateur AI Safety researcher since that seems important.

Technical writing at alexeigannon.com

Alexei G — LessWrong

Alexei G

Message

Neuroscientist researching how brains decide their own learning strategies. Amateur AI Safety researcher since that seems important.

Technical writing at alexeigannon.com

2mo

Alexei G

Neuroscientist researching how brains decide their own learning strategies. Amateur AI Safety researcher since that seems important.

Technical writing at alexeigannon.com

Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models]

Main Finding: When you provide classic alignment-faking evaluation prompts to Claude 4.5 Haiku, you see robust non-compliance mediated by jailbreak detection. When you strip adversarial cues from these prompts, Haiku's compliance increases dramatically and shifts from jailbreak refusal to value-based reasoning. Full write-up for this research sprint with figures: alexeigannon.com/adversarialdetection...

Mar 12•7

A Technical Primer on Mechanistic Interpretability

> Note: This is a static version of an interactive primer with animated visualizations and a glossary sidebar. I recommend reading the full version, but the text stands on its own. Motivation & Background On the one hand, I write this as a neuroscientist who believes the techniques developed by...

Feb 19•1