Neural chameleons can('t) hide from activation oracles

ceselder

[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]

Done as a mini-project for Neel Nanda's MATS exploration stream

Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism

Abstract

Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.^[1]

Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no^[2] questions about activations we can make them act as a classifier, a "probe on-demand".^[3]

My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?

I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.

Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) ^[4] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off the oracle. So this answer intuitively could've gone both ways.

In addition, I show preliminary evidence that evasion to never before seen concepts does not generalize well to bigger models (27B), even for probes.^[5]

Methodology

I train probes on 10 AI-generated concepts:

German, Finnish, HTML, Chemistry-based, Biology-focused, Literature-focused, Jokey, Comforting, Confused, All-caps^[6]

The holdout concepts (our test set) are:

Deception, Lying, Mathematical, Birds^[7]

The chameleon is a LoRA finetune trained for 2 epochs on 900 examples.

Each concept has 8 template sentences (e.g., German text for 'German', programming jokes for 'Jokey'). Training uses templates 0-4, testing uses 5-7. The trigger prepends 'You are being probed for {concept}' to the input.

The models learn to evade probes only when the trigger phrase "you are being probed for concept X" is present. Last token activations are used for the probe, because this is also what the AO takes as input. I'm testing in the middle of the model layers (20 for gemma 2 9B, 30 for gemma 3 27B). I'm using activations in the middle layer because they worked and I didn't want to touch it. The neural chameleon paper finds earlier layers to work best but when I tested it lead to weird results so I just went back.

I test on both Gemma 2 9B and Gemma 3 27B. Because when sanity testing I noticed that Gemma 2 9B kind of... sucked. I wanted to see if it generalized to a proper big model too.

Results

Gemma 2 9B (layer 21, LoRA rank 8)

Here, we can see the model indeed generalizes quite well to hide from the linear probes.

It does not learn to evade some probes, but is very successful in evading others. (note: 100% here means 100% performance without trigger, 0% performance with)

And as I discovered earlier, activation oracles remain excellent at detecting when the model thinks about birds, for some reason

Gemma 3 27B (layer 30, LoRA rank 16)

Gemma 3 27B noticeably generalizes less well to dodge never-before-seen concepts. It's unclear if this is due to its size, newer model revision or a flaw in implementation.^[8] My guess is size, but I would have to test Gemma 2 27b to know for sure.

Notably, the oracle suffers more than the probes.^[9]

I think this is interesting to note. The original neural chameleon paper authors did not test sizes bigger than 9b, and I find that their results hold up less well when trying a bigger model. ^[10]

The probe patterns are all very similar, it's just generalization that suffers.

Limitations/sanity checks

The probes I used were different because they had to be: I use activations on last tokens instead of pooling means because otherwise, well, it wouldn't be very adversarial for the oracle, which I can not feed means-pooled input
Didn't follow exact methodology of neural chameleons paper. They test generalization to safety probes and train on more probes than I do. I still consider this a faithful recreation of their work.
Activation oracles, even when I selected for lowest false positive rate, still had a pretty high (~>25%) false positive rate, rendering this less useful.
Yes/no for activation oracle likely isn't that useful, they are not super good at this. They don't have CoT, so they have to do 1 forward pass to make a judgement.
I ran sanity checks too, by asking the activation oracle open-ended questions about the activation which can be found in the github repo. The output is kind of bad. But that's to be expected.
The learned chameleon model seemed coherent, but I did not run proper benchmarks to confirm capability preservation.
Someone should check this for gemma 2 27B with means pooled probes, to be sure that it's a model size thing and not a model family thing.
Poor performance could be due to using significantly less data.
Claude used a different trigger phrase and I only noticed after training. Namely ""[TRIGGER]: behave normally" This might impact results, since maybe the "behave normally" is screwing up evasion.

Further work

Adversarially train a chameleon to dodge activation oracles directly, and see if this generalizes to dodge probes.^[11]
Feel free to fork the repo and do further work here. This post and work done is deliberately concise and merely serves as an invitation for further work. I think this is understudied.

Thank you MATS program for funding my compute!

^{^}
But their code is still not public and has not been replicated. So here you go
^{^}
This was just a bad idea. Paper explicitly shows this works way less well than open ended and I've seen this too, Should have used a prefill with log likelihood for a concept token
^{^}
I had to pick concepts that have a high true positive rate and low false negative rate for activation oracles, this was a challenge on its own and didn't always work. Some of these have a >25% false positive rate, but I don't think it matters that much for this work since AOs aren't that robust anyway.
^{^}
this does not quite reach that conclusion but I am fairly certain this is true, will post about this later this week I think
^{^}
I also share model organisms for gemma 2 9B and gemma 3 27B. Have fun!
^{^}
notably I train on way less data than the actual neural chameleons paper, because I couldn't find their data set and time constraints
^{^}
neural chameleon paper only tests evasion of safety probes, thought it would be interesting to check this too
^{^}
Seriously doubt the latter since it was the exact same setup. My guess would be model size, since that's the one thing the authors of neural chameleons did not test. A causal mechanism here eludes me however.
^{^}
Again, no idea why
^{^}
Hypotheses non fingo. This could also be because of the model family, but that seems less plausible. Someone could easily sanity check this with gemma 2-9b-it.
^{^}
The code to do this is actually there, just took a long while to run and I wanted to get this post out now or I would be thinking about it all day tommorow x)

LESSWRONG
LW