x

LESSWRONG

LW

zekem — LessWrong

zekem

zekem

Message

8

2mo

zekem

8

2mo

Defeating Introspection Adapters (and Why Threat Models Matter)

by Nick Merrill and zekem

We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in...