x

LESSWRONG

LW

Nick Merrill — LessWrong

Nick Merrill

Nick Merrill

Message

Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I'm interested in interpretability, particularly introspection and introspective access. https://else.how

18

2

2

10mo

Nick Merrill

Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I'm interested in interpretability, particularly introspection and introspective access. https://else.how

Defeating Introspection Adapters (and Why Threat Models Matter)

We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in...

Emergent introspection does not replicate on Llama-3.1-405B

TL;DR: I couldn't replicate Lindsey (2026) (LW post) on Llama-3.1-405B-Instruct (0/400 trials, 0/80 false positives in the canonical sweep; null robust to layer ∈ {70, 84, 90, 100} and to a dense magnitude grid at the suspected crossover). So: how does introspection emerge in models? The optimistic reading of Lindsey's...

Merrill's razor

Never attribute to malice what can be adequately explained by depression or anxiety.

Sep 22, 2025•1