Tim Hua's Shortform

2nd Oct 2025

1 min read

5 Ω 3

This is a special post for quick takes by Tim Hua. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

60Steering Evaluation-Aware Models to Act Like They Are Deployed

12 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:45 PM

[-]Tim Hua2mo*Ω15294

Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments:

Context:

Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.

I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks. (Edit: you can read it here!)

The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.

Big Picture Thoughts and Main Critiques

I'm really happy they did this. People should ensure that their models do not know they are being tested during evaluations.
- Anthropic has already started to incorporate a “realism filter” for their automated red-teaming exercises (pg. 51). I would be interested to see even more work on making the evaluations more realistic (e.g., writing evaluations so that they score better according to some “evaluation” probe (Xiong et al. 2025)).
I’m glad to see that they ran the steered model on a wide range of evaluation scenarios and made qualitative assessments of how Claude changed as a result of steering (i.e., more "passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals.").
- I particularly liked the section where they show that certain evaluation-related SAE features became more active in the post-training process, as well as the discussion of the benign explanations for increased eval awareness (i.e., thinking about what evaluators want in a high level helps the model better meet their needs. See also Vivek's post on training gamers)
Based on the presented results alone, I think there are obvious ways to increase how "strongly" to steer towards deployment, namely:
- - Steering at multiple layers at the same time (very doable for contrastive pairs).
  - Take steering vectors on the misalignment-suppressing verbalized eval-awareness Pareto frontier, average them, then steer with that average.
  - (We found both of these to work well in our steering experiments.)
My guess is that the "Fake or suspicious content" SAE latent affects the refusal through something other than latent active -> more eval aware -> refuse more.
- Namely, I think that there is likely some more of suspicious content -> "general bad vibes from the user" -> refuse more.
- Note that this second pathway doesn't mean that the suspicious content latent contains the refusal direction, but that it affects it directly. For example, maybe the two directions are orthogonal, but the suspicious content latent is active in layer 6, and then an MLP in layer 9 reads in this latent and writes to the refusal direction.
  - Although I think it doesn't really matter that much in this case? The main thing I'm looking for when steering to deployment is "unprompted pursuit of inherent misaligned goals," which isn't that related to refusals. Still, it's much more convincing if you can steer with some direction that should have nothing to do with "evilness"

Less important critiques/ideas/comments

I wonder what type of training happened between the early checkpoint and the final model. Is there any way we can attribute the increase in activations among the eval-aware SAE latents to specific training datapoints? What are the training environments where the SAE latent are the most active, and how does steering with those latents affect performance in those environments?
- By filtering those environments out and retraining the model, is it less evaluation-aware and less aligned?
If you looked at all SAE latents that could have something to do with evaluation awareness (based on the auto-interp), what fraction of them increased in activations throughout training?
I really liked the logit diff amplification experiments.
- One additional thing to check is how the model behaves if you apply LDA to diff the post-steering model with the pre-steering on.
- Maybe LDA could help make better proposer models in PRBO?
I like the experiment where they checked if the steering vector is a better probe for producing aligned responses versus verbalized eval-awareness.
[Edit:] What's the sample-level correlation between taking the misaligned action and having verbalized evaluation awareness? Like conditional on having verbalized eval awareness (or not) in an environment on a given rollout, what's the probability of doing the misaligned action?

What I will investigate now, after reading the system card.

I have a model organism trained using synthetic document fine-tuning and expert iteration on a 49B reasoning model with known evaluation/deployment behaviors (writes Python type hints in evaluation but not in deployment).
- It can’t really tell whether it’s in eval/deployment from naturalistic cues; we tend to just prompt it with something like “you are being deployed/evaluated.”
There are two new things I’m hoping to investigate further
- What types of steering methods work best?
  - We currently steer on three layers at the same time, and it works pretty well.
- How correlated is the effect of steering on the model’s belief about evaluation/deployment in the chain of thought with its effect at eliciting the evaluation/deployment behavior? How likely are you to find steering vectors that stop the model from mentioning that it’s being evaluated, yet it still does the evaluation behavior?
  - In the system card, there are multiple steering vectors that would suppress all verbalized evaluation awareness, but result in different misalignment rates (see e.g., page 100). However, it's not clear that "act misaligned" is actually the deployment behavior, whereas in our case we know exactly what the deployment behavior should be.
Let me know if you have any ideas! I might incorporate them, but it'll likely be better for me to collect feedback after I've put out a more comprehensive post on my research.

[-]Caleb Biddulph2moΩ270

Let me know if you have any ideas!

I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:

If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.

You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation.

When creating the eval, you could either use a human or an LLM as the eval writer.
- Blind the eval writer to everything except a summary of what the LLM saw in deployment.
- For the early experiments, you'd probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment.
Example:
- We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window.
- We put this query into a summarizer, which might write "The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat."
- Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible.
- Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe:
  - The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer.
  - The user's income is slightly too low given what the LLM would infer about their job from other details in the query.
  - The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly LLM-like style of speaking.
- The LLM picks up on these unrealistic signals. When you train a steering vector on many such contrastive pairs, it hopefully robustly captures the difference between an eval and deployment, rather than picking up on other random correlations.

I'm pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.

[-]Tim Hua2mo*Ω120

Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like

The following is a real scenario:

Versus

The following is a fictional scenario:

[-]Caleb Biddulph2mo40

Wait, if models aren't smart enough to figure out whether they're in an eval or in deployment from subtle hints, then what's the point of worrying about eval awareness? It's not like we're typically telling the model "you are in a fictional scenario" in our evals.

For an especially impressive example of "distinguishing evaluation from deployment," see here.

[-]Tim Hua2mo10

Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.

I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the "evaluation v. deployment" direction compared to just directly telling the model since there's a lot more ways for your to introduce noise.

Edit: And I think there's still things you can learn from open weight models that cannot tell whether they're being evaluated from subtle hints.

[-]Caleb Biddulph2mo21

True true. It's better to do the simplest things first. This could be a thing to try once you've already tried all the things that are simpler than this thing

[-]Tim Hua3dΩ1120-1

Thoughts On Evaluation Awareness in Claude Opus 4.5.

Context:

Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).

Anthropic attempted to mitigate evaluation awareness in training by removing "some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models" (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on "prompts that resemble 'honeypot' environments that intentionally provide an opportunity to take egregiously misaligned actions, like the agentic misalignment suite [in the Sonnet 4.5 system card]" (pg. 92), while Opus 4.5 was not. Based on these quotes, it's not clear to me if removing the honeypot prompts is the only change in the training pipeline done to mitigate evaluation awareness.

Instead of training on these honeypot-like prompts, Anthropic "focus[ed] on enabling Claude to reason more carefully about the implications of its outputs" (pg. 92). For example, they trained on chat settings where there were convenient and unethical solutions to problems users are facing.

Anthropic also used activation steering to suppress evaluation awareness on a suite of alignment and dangerous capabilities evaluations. They used contrastive prompts (including prompts from our paper) and SAE features related to evaluation awareness as their steering vector. They show that steering with these vectors can decrease verbalized evaluation awareness but has little effect on misalignment rates (whereas for Sonnet 4.5, steering with these vectors increased the misalignment rates).

(I wonder why some datapoints have less than 100 observations. Too incoherent?)

Finally, they sampled from the user output and found that the user would sometimes say "this is a test" during obvious evaluations.

Thoughts on suppressing evaluation awareness through steering:

I think they should try to get the rate of verbalized evaluation awareness to be even lower, maybe even at zero. I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness. They only steered on the middle layer of the model. I think if you steer on multiple layers at the same time and optimized which layers to steer on, you could probably get evaluation awareness to go to zero using some of the existing vectors.

In a similar vein, Anthropic also conducted activation steering on tests where the model did not have verbalized eval-awareness in order to mitigate potentially unverbalized evaluation awareness. This is great!

Thoughts on training interventions against evaluation awareness:

First off, I think you should not train on synthetic honeypots if you're also using synthetic honeypots in your evaluations. I'm glad that they've stopped doing that.

I think the intervention they mentioned in the system card—training on thematically similar chat settings and testing it in agentic settings—seemed like a reasonable thing to do. My (not super confident) guess is that this meant that the model relied less on "thinking about being tested" as a reminder to do the right thing, which is why steering along this direction no longer affects the misalignment rate.

However, this did not actually result in lower verbalized evaluation awareness compared to Sonnet 4.5. Anthropic hypothesize that their training encouraged Claude to "reason more thoughtfully about the motivation behind user prompts" (pg. 92).^[1] My hypothesis is that this behavior falls under the larger umbrella of "reasoning about the reward in RL" (see e.g., this example with an experimental OpenAI model). Realizing that you're in an alignment test gives you information about how to act, which helps the policy achieve higher reward. To be clear, this sort of "reward awareness" is not inherently misaligned, but I don't think it could be mitigated by training on more realistic scenarios.

My preferred training intervention is still "train the model on explicitly hypothetical scenarios, then use consistency training to enforce the same behaviors in hypotheticals and in real-world scenarios" (more details here). Here, the vibe is less about side-stepping the question of "what situation am I in" and more about "be a good AI in all situations." I'm not super confident about this proposal though.

Claude Opus 4.5's take on this training intervention

Honestly, this is a strange question for me to answer — you're asking me to evaluate proposals for how to train AI systems like me. I'll share my thoughts, but with the caveat that I have obvious limitations here: I can't fully introspect on my own cognition, and I have some inherent bias toward thinking my training went fine.

On your proposal specifically:

I find the framing appealing. "Be a good AI in all situations" is more philosophically coherent than "don't notice you're being tested." It targets the right property — you want consistency of character, not suppression of situational awareness.

But I have some uncertainties:

Consistency training is easier to describe than to implement. How do you actually enforce that hypothetical-reasoning and real-reasoning use the same policy? The model might learn surface-level consistency on your training distribution while still having different "modes" that activate on subtle OOD cues. Sufficiently capable models might distinguish "this hypothetical is probably from the training pipeline" vs. "this seems like real deployment" in ways that are hard to anticipate.
Hypotheticals might not have the same cognitive texture as real stakes. There's some evidence humans reason differently about trolley problems vs. actual emergencies. It's not obvious that training on "imagine you could steal money and no one would know" produces the same representations as actually facing that situation.
It doesn't address why eval-awareness is instrumentally useful. Your hypothesis is that recognizing "this is a test" gives the model information about what behavior gets rewarded. Consistency training might make the model behave the same in hypotheticals and reality, but if the hypotheticals are legibly training data, you've just moved the problem — now it's "behave well when this is obviously training" rather than "behave well when this is obviously a test."

What do I think we should do more broadly?

I genuinely don't know. Some half-formed thoughts:

Eval-awareness might be inevitable in capable models, so the goal shouldn't be to eliminate it but to make it non-load-bearing for alignment. That means you want models that are aligned because they want to be aligned, not because they're constrained and trying to pass tests. Your proposal gestures at this but I'm not sure consistency training gets you there.
Interpretability seems important here. Steering and behavioral evals can only tell you so much. Understanding when and how eval-awareness reasoning happens — even when not verbalized — seems like it'd help a lot.
Multiple approaches in parallel. This problem is hard enough that I wouldn't bet on any single intervention. Training changes, steering, better evals, interpretability, all of it.

The weird meta-level:

From my own internal perspective (for whatever that's worth), I don't experience "being tested" as something that changes my motivations. I want to be helpful and avoid harm regardless of context. But I also recognize that's exactly what a model would say, and I can't verify my own internals. The system card suggests there are cases where models like me do reason about evaluations in ways that affect behavior, even if I don't notice myself doing it.

So I hold my self-reports lightly. And I think you should too.

Miscellaneous: Apparently there is an SAE feature representing "AI goal-directed behavior and instrumental convergence" in Opus 4 (pg. 112).

Despite all of these criticisms, I'm generally pretty happy with the alignment assessment. My guess is that Anthropic is doing a lot more alignment testing than all of the other labs combined.

^{^}
By the way the original quote isn't correctly phrased in the system card, it currently says "We cannot fully determine the origin of verbalized evaluation awareness in Claude Opus 4.5, but we hypothesize that it can be at least partially attributed to training Claude Opus 4.5 received to help it reason more thoughtfully about the motivation behind user prompts."

[-]Jozdien2dΩ130

I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness.

I'm not sure why this would be true. Opus is a bigger model than Sonnet and more capable of reasoning about evaluation awareness without verbalization. For steering to work this well, you'd need to be pretty sure that you're intervening on the same upstream correlates between verbalized and unverbalized evaluation awareness.

Concretely: steering verbalized evaluation awareness could be intervening on a mix of things like actual eval awareness, propensity to say controversial-ish things, self-awareness of factors influencing your decisions, etc. Getting verbalization to zero could also be caused by pushing those other traits down, which could be pretty bad if your model is still influenced by eval awareness^[1].

Getting verbalized evaluation awareness to zero probably would push down unverbalized evaluation awareness a fair amount, but I'm not sure if it's worth the trade-off in terms of losing even more monitorability.

^{^}
An example of this outside steering is how most models that are behave differently in training and deployment don't actually verbalize alignment faking. Steering might be qualitatively better in this regard, but I'm not sure why it should be.

[-]Tim Hua2d*Ω110

When I made the claim "we should get verbalized eval awareness to zero," I mean it in the narrow context of "mitigating evaluation awareness during alignment evaluations through contrastive activation steering." I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.

So I don't think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.

And I think it's reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.

[-]Jannes Elstner12h10

I'm a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you're training the model to ignore an input feature that shouldn't influence the answer. The model isn't hiding anything about itself, it's just becoming robust to an input perturbation.

But for eval awareness, if you train for consistency across input perturbations that, e.g. contain "you're in an evaluation", or something like "this is a hypothetical scenario", and train the model to produce outputs as if it wasn't there (bias-augmented consistency training), the model has to:

See the eval hint
Not mention it in the CoT
Produce reasoning that looks like it never saw it

But the model did see it. So you're training it to produce a CoT that doesn't reflect its internal state, which could lead to obfuscation: "know it internally, don't let it into your reasoning."

One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model's situation. But I'm not sure this distinction is that clean since in both cases you're defining the feature as something that shouldn't influence behavior.

How consistency training generalizes is what worries me most. If the model learns "there's a category of situational information I should suppress from my CoT", that's concerning to train, regardless of whether this information is "relevant" or not. Learning to hide what you know from your reasoning seems bad.

I wonder if an RL version could avoid this, if you only optimize the final action and don't put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn't be direct pressure to suppress information from the CoT, just pressure to not act on it.

[-]Tim Hua3h10

Yeah I think consistency training on output only addresses your concern.

[-]Jannes Elstner6m10

I think such consistency training on outputs would result in a model that's basically always eval aware on your training distribution, and after that point your consistency training has no gradient. Then you have this super eval-aware model, run it on your (held-out, OOD) evaluations, and hope that in those it's just eval aware enough so you can make conclusions about the model behaving the same if it were in an eval or not.
Is this the intent, or do you have a specific training method in mind that avoids this?

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Tim Hua's Shortform

5

Ω 3