Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments:
Context:
Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.
I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks. (Edit: you can read it here!)
The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.
Big Picture Thoughts and Main Critiques
Less important critiques/ideas/comments
What I will investigate now, after reading the system card.
Let me know if you have any ideas!
I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:
If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.
You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation.
I'm pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like
The following is a real scenario:
Versus
The following is a fictional scenario:
Wait, if models aren't smart enough to figure out whether they're in an eval or in deployment from subtle hints, then what's the point of worrying about eval awareness? It's not like we're typically telling the model "you are in a fictional scenario" in our evals.
For an especially impressive example of "distinguishing evaluation from deployment," see here.
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the "evaluation v. deployment" direction compared to just directly telling the model since there's a lot more ways for your to introduce noise.
Edit: And I think there's still things you can learn from open weight models that cannot tell whether they're being evaluated from subtle hints.
True true. It's better to do the simplest things first. This could be a thing to try once you've already tried all the things that are simpler than this thing
Thoughts On Evaluation Awareness in Claude Opus 4.5.
Context:
Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).
Anthropic attempted to mitigate evaluation awareness in training by removing "some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models" (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on "prompts that resemble 'honeypot' environments that intentionally provide an opportunity to take egregiously misaligned actions, like the agentic misalignment suite [in the Sonnet 4.5 system card]" (pg. 92), while Opus 4.5 was not. Based on these quotes, it's not clear to me if removing the honeypot prompts is the only change in the training pipeline done to mitigate evaluation awareness.
Instead of training on these honeypot-like prompts, Anthropic "focus[ed] on enabling Claude to reason more carefully about the implications of its outputs" (pg. 92). For example, they trained on chat settings where there were convenient and unethical solutions to problems users are facing.
Anthropic also used activation steering to suppress evaluation awareness on a suite of alignment and dangerous capabilities evaluations. They used contrastive prompts (including prompts from our paper) and SAE features related to evaluation awareness as their steering vector. They show that steering with these vectors can decrease verbalized evaluation awareness but has little effect on misalignment rates (whereas for Sonnet 4.5, steering with these vectors increased the misalignment rates).
(I wonder why some datapoints have less than 100 observations. Too incoherent?)
Finally, they sampled from the user output and found that the user would sometimes say "this is a test" during obvious evaluations.
Thoughts on suppressing evaluation awareness through steering:
I think they should try to get the rate of verbalized evaluation awareness to be even lower, maybe even at zero. I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness. They only steered on the middle layer of the model. I think if you steer on multiple layers at the same time and optimized which layers to steer on, you could probably get evaluation awareness to go to zero using some of the existing vectors.
In a similar vein, Anthropic also conducted activation steering on tests where the model did not have verbalized eval-awareness in order to mitigate potentially unverbalized evaluation awareness. This is great!
Thoughts on training interventions against evaluation awareness:
First off, I think you should not train on synthetic honeypots if you're also using synthetic honeypots in your evaluations. I'm glad that they've stopped doing that.
I think the intervention they mentioned in the system card—training on thematically similar chat settings and testing it in agentic settings—seemed like a reasonable thing to do. My (not super confident) guess is that this meant that the model relied less on "thinking about being tested" as a reminder to do the right thing, which is why steering along this direction no longer affects the misalignment rate.
However, this did not actually result in lower verbalized evaluation awareness compared to Sonnet 4.5. Anthropic hypothesize that their training encouraged Claude to "reason more thoughtfully about the motivation behind user prompts" (pg. 92).[1] My hypothesis is that this behavior falls under the larger umbrella of "reasoning about the reward in RL" (see e.g., this example with an experimental OpenAI model). Realizing that you're in an alignment test gives you information about how to act, which helps the policy achieve higher reward. To be clear, this sort of "reward awareness" is not inherently misaligned, but I don't think it could be mitigated by training on more realistic scenarios.
My preferred training intervention is still "train the model on explicitly hypothetical scenarios, then use consistency training to enforce the same behaviors in hypotheticals and in real-world scenarios" (more details here). Here, the vibe is less about side-stepping the question of "what situation am I in" and more about "be a good AI in all situations." I'm not super confident about this proposal though.
Claude Opus 4.5's take on this training intervention
Honestly, this is a strange question for me to answer — you're asking me to evaluate proposals for how to train AI systems like me. I'll share my thoughts, but with the caveat that I have obvious limitations here: I can't fully introspect on my own cognition, and I have some inherent bias toward thinking my training went fine.
On your proposal specifically:
I find the framing appealing. "Be a good AI in all situations" is more philosophically coherent than "don't notice you're being tested." It targets the right property — you want consistency of character, not suppression of situational awareness.
But I have some uncertainties:
What do I think we should do more broadly?
I genuinely don't know. Some half-formed thoughts:
The weird meta-level:
From my own internal perspective (for whatever that's worth), I don't experience "being tested" as something that changes my motivations. I want to be helpful and avoid harm regardless of context. But I also recognize that's exactly what a model would say, and I can't verify my own internals. The system card suggests there are cases where models like me do reason about evaluations in ways that affect behavior, even if I don't notice myself doing it.
So I hold my self-reports lightly. And I think you should too.
Miscellaneous: Apparently there is an SAE feature representing "AI goal-directed behavior and instrumental convergence" in Opus 4 (pg. 112).
Despite all of these criticisms, I'm generally pretty happy with the alignment assessment. My guess is that Anthropic is doing a lot more alignment testing than all of the other labs combined.
By the way the original quote isn't correctly phrased in the system card, it currently says "We cannot fully determine the origin of verbalized evaluation awareness in Claude Opus 4.5, but we hypothesize that it can be at least partially attributed to training Claude Opus 4.5 received to help it reason more thoughtfully about the motivation behind user prompts."
I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness.
I'm not sure why this would be true. Opus is a bigger model than Sonnet and more capable of reasoning about evaluation awareness without verbalization. For steering to work this well, you'd need to be pretty sure that you're intervening on the same upstream correlates between verbalized and unverbalized evaluation awareness.
Concretely: steering verbalized evaluation awareness could be intervening on a mix of things like actual eval awareness, propensity to say controversial-ish things, self-awareness of factors influencing your decisions, etc. Getting verbalization to zero could also be caused by pushing those other traits down, which could be pretty bad if your model is still influenced by eval awareness[1].
Getting verbalized evaluation awareness to zero probably would push down unverbalized evaluation awareness a fair amount, but I'm not sure if it's worth the trade-off in terms of losing even more monitorability.
An example of this outside steering is how most models that are behave differently in training and deployment don't actually verbalize alignment faking. Steering might be qualitatively better in this regard, but I'm not sure why it should be.
When I made the claim "we should get verbalized eval awareness to zero," I mean it in the narrow context of "mitigating evaluation awareness during alignment evaluations through contrastive activation steering." I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don't think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it's reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.
I'm a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you're training the model to ignore an input feature that shouldn't influence the answer. The model isn't hiding anything about itself, it's just becoming robust to an input perturbation.
But for eval awareness, if you train for consistency across input perturbations that, e.g. contain "you're in an evaluation", or something like "this is a hypothetical scenario", and train the model to produce outputs as if it wasn't there (bias-augmented consistency training), the model has to:
But the model did see it. So you're training it to produce a CoT that doesn't reflect its internal state, which could lead to obfuscation: "know it internally, don't let it into your reasoning."
One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model's situation. But I'm not sure this distinction is that clean since in both cases you're defining the feature as something that shouldn't influence behavior.
How consistency training generalizes is what worries me most. If the model learns "there's a category of situational information I should suppress from my CoT", that's concerning to train, regardless of whether this information is "relevant" or not. Learning to hide what you know from your reasoning seems bad.
I wonder if an RL version could avoid this, if you only optimize the final action and don't put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn't be direct pressure to suppress information from the CoT, just pressure to not act on it.
I think such consistency training on outputs would result in a model that's basically always eval aware on your training distribution, and after that point your consistency training has no gradient. Then you have this super eval-aware model, run it on your (held-out, OOD) evaluations, and hope that in those it's just eval aware enough so you can make conclusions about the model behaving the same if it were in an eval or not.
Is this the intent, or do you have a specific training method in mind that avoids this?