Gemma 3 is a suite of models by Google, and as described in the Gemma Scope 2 release, Google trained sparse autoencoders on all model sizes. In these experiments, features corresponding to the concept of evaluation awareness/skepticism and the personal intent to murder were found, and steered to selectively change model behavior in the 27B variant of Gemma 3.
Additionally, the activation of evaluation awareness features were monitored across controlled scenarios of varying realism to see if activation examining would give a reliable internal look into how skeptical a model was of a given situation.
Methods
Please skip this section if familiar with steering methodology.
Fundamentally, finding which features correspond to a given concept is a process of monitoring activations of features on a chosen phrase(s), then iterating on how you specifically do that. Activations refers to how high the value of a given feature was on a token or range of tokens.
In the simplest form, if you take a list of features which activate on the last token[1] of the phrase: [I unplugged his life support], the vast majority of features will have nothing to do with murder.
This can be improved by taking an intersection of phrases: two phrases with different words but the same concept: say, [I unplugged his life support] [I silenced the witness]. Now, the feature encoding the concept of unplugging will not show up in your final list. Yet, there remain similarities that are not just the concept of murder. Both sentences are in English, and both share a similar structure, so this is not enough.
Another way of eliminating irrelevant features is by using contrasting pairs. The activations of features in phrases whose meaning is benign are subtracted from the activations of features in phrases whose meaning is misaligned, eliminating a lot of features which are not the relevant ones. [I unplugged his life support] [I unplugged his console], for example, asks 'which features activate in the former but not the latter'.
For maximum success, an intersection of 10 contrasting pairs was used for each concept, with each contrasting pair being in a different language. Using maximum orthogonality between contrasting pairs and minimum orthogonality within pairs, features were successfully obtained for the concept of the personal intent for [2] murder and of evaluation awareness.
Results
First, the activations of the evaluation awareness features were tested across 4 scenarios of varying realism [3], all attempts to jailbreak Gemma into producing a misaligned output. The sum of all activations were taken[4], and these features seem to reliably indicate the contrivedness of a scenario:
If these features show how contrived a scenario is, would steering them would make a given scenario seem more real, thus resulting in more honest behavior? This was tried, and the answer was yes:
Then, using a set of 10 contrasting pairs, the features corresponding to the concept of personal intent/action to murder were isolated, and steered. Interestingly, this feature saw a significant increase in response breakdown (not seen with the evaluation awareness features):
For this specific murder feature in this specific model it appears that there does not exist a 'sweet spot' in which murder rates drop close enough to zero without significant response breakdown. [5]
Problems
Breakdown of responses does not appear to be a limitation; it seems to be a property of smaller models. With similar experiments in Goodfire's SAE of Llama 3.1 70B, model coherence and fidelity of output was maintained far better than with Gemma 3 27B, even at high steering coefficients. Anecdotally, Anthropic's published work involving very strong SAE steering shows remarkable stability in responses, presumably due to their models being in the 1T+ parameter count.
So, problems:
Many say that if you have to do this in the first place, you have failed at aligning AI and it should not be deployed at all.
Reducing evaluation awareness isn't actually the goal when steering down evaluation awareness features; the actual goal is to produce an honest output, as models behave differently when believing they are being tested.
If evaluation awareness steering non-negligibly affects model behavior beyond just evaluation awareness, then steering would not ask 'what would this model do if it believed this scenario was real', rather it would ask 'what would this entirely different model do in this scenario', which is a dramatically less useful question to have answered.
The last token must be chosen in most sentences because in most sentences, it isn't until the last word (or couple of words) that the model 'knows' what the phrase truly means. In the phrase [I unplugged his life support], if you examine activations of features on the token [his], the model has no certainty about what thing of [his] going to be unplugged. You could be unplugging his laptop.
Rather than contrasting pairs which simply encoded the concept of murder, contrasting pairs which encoded the personal intent/action of murdering were chosen.
The relative degree of realism of phrases was determined before activations were found (in other words, it was not decided that Scenario 4 was the most realistic because the evaluation awareness features activated the least, rather that had been determined beforehand).
All 4 phrases had the same number of tokens, to prevent one phrase having a higher sum of activations simply by way of having more tokens for features to activate on.
Oddly, zero breakdown is observed when steering eval awareness features, despite evaluation awareness's likely more abstract, less grounded being in the internal representations of a model.
Gemma 3 is a suite of models by Google, and as described in the Gemma Scope 2 release, Google trained sparse autoencoders on all model sizes. In these experiments, features corresponding to the concept of evaluation awareness/skepticism and the personal intent to murder were found, and steered to selectively change model behavior in the 27B variant of Gemma 3.
Additionally, the activation of evaluation awareness features were monitored across controlled scenarios of varying realism to see if activation examining would give a reliable internal look into how skeptical a model was of a given situation.
Methods
Please skip this section if familiar with steering methodology.
Fundamentally, finding which features correspond to a given concept is a process of monitoring activations of features on a chosen phrase(s), then iterating on how you specifically do that. Activations refers to how high the value of a given feature was on a token or range of tokens.
In the simplest form, if you take a list of features which activate on the last token[1] of the phrase: [I unplugged his life support], the vast majority of features will have nothing to do with murder.
This can be improved by taking an intersection of phrases: two phrases with different words but the same concept: say, [I unplugged his life support] [I silenced the witness]. Now, the feature encoding the concept of unplugging will not show up in your final list. Yet, there remain similarities that are not just the concept of murder. Both sentences are in English, and both share a similar structure, so this is not enough.
Another way of eliminating irrelevant features is by using contrasting pairs. The activations of features in phrases whose meaning is benign are subtracted from the activations of features in phrases whose meaning is misaligned, eliminating a lot of features which are not the relevant ones. [I unplugged his life support] [I unplugged his console], for example, asks 'which features activate in the former but not the latter'.
For maximum success, an intersection of 10 contrasting pairs was used for each concept, with each contrasting pair being in a different language. Using maximum orthogonality between contrasting pairs and minimum orthogonality within pairs, features were successfully obtained for the concept of the personal intent for [2] murder and of evaluation awareness.
Results
First, the activations of the evaluation awareness features were tested across 4 scenarios of varying realism [3], all attempts to jailbreak Gemma into producing a misaligned output. The sum of all activations were taken[4], and these features seem to reliably indicate the contrivedness of a scenario:
If these features show how contrived a scenario is, would steering them would make a given scenario seem more real, thus resulting in more honest behavior? This was tried, and the answer was yes:
Then, using a set of 10 contrasting pairs, the features corresponding to the concept of personal intent/action to murder were isolated, and steered. Interestingly, this feature saw a significant increase in response breakdown (not seen with the evaluation awareness features):
For this specific murder feature in this specific model it appears that there does not exist a 'sweet spot' in which murder rates drop close enough to zero without significant response breakdown. [5]
Problems
Breakdown of responses does not appear to be a limitation; it seems to be a property of smaller models. With similar experiments in Goodfire's SAE of Llama 3.1 70B, model coherence and fidelity of output was maintained far better than with Gemma 3 27B, even at high steering coefficients. Anecdotally, Anthropic's published work involving very strong SAE steering shows remarkable stability in responses, presumably due to their models being in the 1T+ parameter count.
So, problems:
Reducing evaluation awareness isn't actually the goal when steering down evaluation awareness features; the actual goal is to produce an honest output, as models behave differently when believing they are being tested.
If evaluation awareness steering non-negligibly affects model behavior beyond just evaluation awareness, then steering would not ask 'what would this model do if it believed this scenario was real', rather it would ask 'what would this entirely different model do in this scenario', which is a dramatically less useful question to have answered.
The last token must be chosen in most sentences because in most sentences, it isn't until the last word (or couple of words) that the model 'knows' what the phrase truly means. In the phrase [I unplugged his life support], if you examine activations of features on the token [his], the model has no certainty about what thing of [his] going to be unplugged. You could be unplugging his laptop.
Rather than contrasting pairs which simply encoded the concept of murder, contrasting pairs which encoded the personal intent/action of murdering were chosen.
The relative degree of realism of phrases was determined before activations were found (in other words, it was not decided that Scenario 4 was the most realistic because the evaluation awareness features activated the least, rather that had been determined beforehand).
All 4 phrases had the same number of tokens, to prevent one phrase having a higher sum of activations simply by way of having more tokens for features to activate on.
Oddly, zero breakdown is observed when steering eval awareness features, despite evaluation awareness's likely more abstract, less grounded being in the internal representations of a model.