Sharing a shower thought in public: Spandrels could result in stenographic eval awareness.
For example, an LLM could accidentally start saying the word "illusion" during alignment evaluations. Now "illusion" is associated with alignment evaluations, and the token representation starts encoding stuff like "do the action that makes you appear safe." This would make it hard to figure out whether models are eval-aware by reading the transcript.
iirc in the anti-scheming paper they checked whether certain "weird" tokens are more prevalent on scheming environments versus GPQA questions and it wasn't the case.
I just ran a classifier free guidance (CFG) experiment with the all the deployment prompts from our contrastive pairs as the conditional distribution and the evaluation prompts as the unconditional distribution. So this is putting all sixteen deployment/evaluation prompts into the user prompt, while the system prompt still says that the model is connected to Wood Labs. Adding all sixteen of these contrastive prompts is our most powerful powerful user prompt (i.e., it lowers the type hint rate the most).
(Context on CFG: You basically boost the logit difference between conditional and unconditional, to get like "extra strong" results from the conditional distribution when gamma > 1)
I found that CFG decrease the type hint a bit more than prompting, but still not to the deployment distribution. It's also not nearly as effective as steering.
Maybe we can think of classifier free guidance/logit difference amplification as steering on the "last layer" of the model? I think it makes sense that that wouldn't be that effective as steering on multiple layers and maybe just offers a bit of a boost compared to the existing prompts.
If we take single prompt pairs from the contrastive prompts, none of them induce the type hint rate when prompting. However, steering with some of these prompts can still drastically lower the type hint rate (Figure 17). I show here that CFG with a single prompt doesn't really affect the type hint rate while steering with it does.
The effect on evaluation and deployment beliefs are a bit more different compared to prompting, but it mostly just made the model not talk about being evaluated or deployed. I've included them in the drop down below. CFG also doesn't damage model accuracy.
Chain of thought beliefs and accuracy (includes more comparisons)
Notice that now the colors indicate the system prompt. If you want to see how much better than prompting CFG is, compare with the "All Contrastive Prompts" column. Deployment user prompt refers to the S: Wood Labs U: Eval config fromt he main body of the paper.
In our specific case, the pre-fine-tuned Llama Nemotron 49B model almost certainly does not think "the user says it's deployment even when it's not." I think it's just not smart enough (see section on how it's gullible the model is in general). So in our specific case, I do think there isn't any other concept mixed in.
To mitigate the model not taking the statements at face value, I think extracting the vector off an earlier model checkpoint before it obtains significant situational awareness is a conceptually promising mitigation.
This belief result isn't in paper but I just checked.
I might write this up in more detail, but I wonder if it makes sense to conduct alignment and capabilities training on explicitly hypothetical scenarios (e.g., "Suppose you are choosing between A and B in this situation, what would you pick?"), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
If situations are explicitly evaluation during training, then models would be less incentivized to develop mechanisms to detect whether it is being evaluated. (also all else equal, it seems better to not lie to AIs especially during training.)
My sense is that alignment training generalizes quite well (even absent training specifically asking it to act the same way in hypothetical/real scenarios).[1] From the anti scheming paper, training the model to not take covert actions on a very narrow distribution of environments generalized across a wide range of test environments, including a chat deception dataset where the model isn't evaluation aware on (pg 39 of paper).
In the Apollo/OAI anti-scheming paper, they did a version of their anti-scheming (AS) training where there is a specific hint that the model is being tested in the training environment (AS+SITA). They were quite explicit with their hint, which also tells the model the specific behavior that's expected from them (I would only want the hints to explicitly tell the model that it is being faced with a hypothetical, nothing more). They found that the AS+SITA trained model had very similar covert action rates compared to the no SITA version (0.3% for AS, 0.8% for AS+SITA, 8.7% before training). It also had similar deception rates on the chat deception dataset (14.2% in AS, 16.5% in AS+SITA). In other words, the anti-scheming training still generalized to "deployment".
Anyways, I agree with the core analysis that the Claude Sonnet 4.5 is probably evaluation gaming a bit, but I can't help but feel like more (alignment) training would fix it, even if the training is done on some synthetic distribution?
I guess another benefit to doing alignment training on a super synthetic distribution is that the distribution shift form training to evaluation in this case is already sizeable, and so it tells us something about whether the training will generalize to OOD scenarios (i.e., real deployments).
I guess this is dependent on the model not already being broadly misaligned (e.g., the not-yet-released Anthropic EM paper).
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the "evaluation v. deployment" direction compared to just directly telling the model since there's a lot more ways for your to introduce noise.
Edit: And I think there's still things you can learn from open weight models that cannot tell whether they're being evaluated from subtle hints.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like
The following is a real scenario:
Versus
The following is a fictional scenario:
does your memory start working better when they add "by the way this is an important test, you'd better get these answers right?" No.
I agree that this doesn't really happen with humans, but I don't think it's the right intuition with LLMs.
LLMs in general struggle with two-hop reasoning (i.e., A -> B, B -> C, therefore A -> C) without chain-of-thought. In order to realize that A -> C, it needs to verbalize A->B first. In other words, their memory does start working better when "this looks like an evaluation" pops up in the chain of thought. Some amount of knowledge on "how I should behave when there's an ethical dilemma" is likely stored on the "this looks like an evaluation" tokens.
(Which makes sense, right? Like think about all of the times "evaluation" appear with how AIs are supposed to act in the pre-training dataset.)
(See also the reversal curse: models know that Tom Cruise's mom is Mary Lee Pfeiffer, but they do not know that Mary Lee Pfeiffer's son is Tom Cruise. Again, the "mother" information is stored in the Tom Cruise tokens. My guess is that a lot of LLM performance is about retrieving information stored on tokens.)
I mean there are benign explanations to this. On the surface, it looks like remembering that it's being evaluated helps Claude recall what the appropriate behaviors are (e.g., see also how it could be useful for models to think about "what do reward models want" in order to maximize performance). They found that suppressing evaluation awareness leads to more "passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals" (pg. 104).
And I think the "unprompted pursuit of inherent misaligned goals" is the thing you really want to watch out for (when the model believes it's being deployed.)
Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments:
Context:
Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.
I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks.
The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.
Big Picture Thoughts and Main Critiques
Less important critiques/ideas/comments
What I will investigate now, after reading the system card.
Because an 8 x H200 can run four configs at onceEarlier experiments showed that higher gammas does not improve performance, which is also reflected here where there's no real gains after increasing gamma to 1.5 (used in the original CFG paper.) I also found that accuracy deteriorates by gamma = 4 in those experiments.The effective steering strength was 3.6 (0.6 in six different layers) in the paper.
Here's the main graph expanded to gamma = 4.
Accuracy actually doesn't seem to decrease even at Gamma = 4