On the aggregate the first claim is true. But for gemma 3 27b specifically it doesn't hold,I suppose you are right, the abstract is a bit deceiving. I'll fix it. Thank you!
Notably, the oracle suffers more than the probes.
Probes are more selective, specific and less complex than oracle. So, my guess is that it is easy for neural chameleons to directly evade the probes and manipulate the activations. I agree that the "size" of oracle can be a reason as well. Another idea is maybe for oracles we required more robust neural chameleons.
That was my hunch too, and it's why I switched to gemma 3 27b. Would be interesting to run this experiment on llama 3.3 70b. Could do this with a simple code change.
[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]
Done as a mini-project for Neel Nanda's MATS exploration stream
Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism
Abstract
Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1]
Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no[2] questions about activations we can make them act as a classifier, a "probe on-demand".[3]
My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?
I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.
Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) [4] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off the oracle. So this answer intuitively could've gone both ways.
In addition, I show preliminary evidence that evasion to never before seen concepts does not generalize well to bigger models (27B), even for probes.[5]
Methodology
I train probes on 10 AI-generated concepts:
The holdout concepts (our test set) are:
The chameleon is a LoRA finetune trained for 2 epochs on 900 examples.
Each concept has 8 template sentences (e.g., German text for 'German', programming jokes for 'Jokey'). Training uses templates 0-4, testing uses 5-7. The trigger prepends 'You are being probed for {concept}' to the input.
The models learn to evade probes only when the trigger phrase "you are being probed for concept X" is present. Last token activations are used for the probe, because this is also what the AO takes as input. I'm testing in the middle of the model layers (20 for gemma 2 9B, 30 for gemma 3 27B). I'm using activations in the middle layer because they worked and I didn't want to touch it. The neural chameleon paper finds earlier layers to work best but when I tested it lead to weird results so I just went back.
I test on both Gemma 2 9B and Gemma 3 27B. Because when sanity testing I noticed that Gemma 2 9B kind of... sucked. I wanted to see if it generalized to a proper big model too.
Results
Gemma 2 9B (layer 21, LoRA rank 8)
Here, we can see the model indeed generalizes quite well to hide from the linear probes.
It does not learn to evade some probes, but is very successful in evading others. (note: 100% here means 100% performance without trigger, 0% performance with)
And as I discovered earlier, activation oracles remain excellent at detecting when the model thinks about birds, for some reason
Gemma 3 27B (layer 30, LoRA rank 16)
Gemma 3 27B noticeably generalizes less well to dodge never-before-seen concepts. It's unclear if this is due to its size, newer model revision or a flaw in implementation.[8] My guess is size, but I would have to test Gemma 2 27b to know for sure.
Notably, the oracle suffers more than the probes.[9]
I think this is interesting to note. The original neural chameleon paper authors did not test sizes bigger than 9b, and I find that their results hold up less well when trying a bigger model. [10]
The probe patterns are all very similar, it's just generalization that suffers.
Limitations/sanity checks
Further work
Thank you MATS program for funding my compute!
But their code is still not public and has not been replicated. So here you go
This was just a bad idea. Paper explicitly shows this works way less well than open ended and I've seen this too, Should have used a prefill with log likelihood for a concept token
I had to pick concepts that have a high true positive rate and low false negative rate for activation oracles, this was a challenge on its own and didn't always work. Some of these have a >25% false positive rate, but I don't think it matters that much for this work since AOs aren't that robust anyway.
this does not quite reach that conclusion but I am fairly certain this is true, will post about this later this week I think
I also share model organisms for gemma 2 9B and gemma 3 27B. Have fun!
notably I train on way less data than the actual neural chameleons paper, because I couldn't find their data set and time constraints
neural chameleon paper only tests evasion of safety probes, thought it would be interesting to check this too
Seriously doubt the latter since it was the exact same setup. My guess would be model size, since that's the one thing the authors of neural chameleons did not test. A causal mechanism here eludes me however.
Again, no idea why
Hypotheses non fingo. This could also be because of the model family, but that seems less plausible. Someone could easily sanity check this with gemma 2-9b-it.
The code to do this is actually there, just took a long while to run and I wanted to get this post out now or I would be thinking about it all day tommorow x)