Selectively reducing eval awareness and murder in Gemma 3 27B via steering
Gemma 3 is a suite of models by Google, and as described in the Gemma Scope 2 release, Google trained sparse autoencoders on all model sizes. In these experiments, features corresponding to the concept of evaluation awareness/skepticism and the personal intent to murder were found, and steered to selectively change...
Mar 107