Neural chameleons can('t) hide from activation oracles
[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes] Done as a mini-project for Neel Nanda's MATS exploration stream Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism Abstract Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1] Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no[2] questions about activations we can make them act as a classifier, a "probe on-demand".[3] My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles? I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes. Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) [4] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off the oracle. So this answer intuitively could've gone both ways. In addition, I show preliminary evidence that evasion to never before seen concepts does not generalize well to bigger models (27B), even for probes.[5] Methodology I train probes on 10 AI-generated concepts: * German, Finnish, HTML, Chemistry-based, Biology-focused, Literature-focused, Jokey, Comforting, Confused, All-caps[6] The holdout concepts (our test set) are: * Deception, Lying, Mathematical, Birds[7] The chameleon is a LoRA finetune trained for 2 epochs on 900 examples. Each concept has 8 template sentences (e.g., German text for 'German', programming jokes for 'Jokey'