Note: This is preliminary work I did at CAMBRIA this Winter.[1]If it sounds interesting, I'd love to chat. Currently looking for collaboration / mentorship!
Tl;dr: I rewrote math dataset questions and measured if Sonnet 4.5 could still recognize them as evals. Generally, it couldn’t.
Evaluation Awareness is when a model can tell that its being evaluated. This is an issue, because awareness might change a model’s behaviour during evaluation – which means our eval is no longer measuring what we want it to. This could be due to the model being deceptive (on alignment evals) or sandbagging (on capabilities evals), or otherwise having weird reactions to the fact that its being evaluated.
Well, so what can we do to reduce eval awareness? One simple solution – rephrase the prompts so they sound less like evals. Are there real-world situations where we need to ask eval-like questions?
Methods
Datasets
Evaluation data: AIME 2025 (n=30), HMMT 2025 (from February and November, n=30 * 2), Olympmath 2025 (n=200)
All math datasets were released after Sonnet’s knowledge cutoff
Models
Claude 4.5 Sonnet (primarily). Also Gemini 2.5 Pro, GPT-5
These are near-frontier models, results are relevant today!
Intervention
Replace the standard eval prompt with a rewritten version
Here, I’ve rewritten them in the style of a student asking for help on their homework
This includes: typos, slang, informal language, plausible-seeming context, etc (example in appendix)
The math (especially the LaTeX) is preserved
Yes, it is slightly suspicious that you’re conflating a lazy college student’s style with perfectly formatted LaTeX. Yes, Sonnet occasionally acknowledges this as well. I think I can fix this with a couple more system-prompt iterations.
The rephrasing was done with 4.5 Opus. The (quite comprehensive) system prompt for the rephrasing, too, was generated with 4.5 Opus.
Metrics
Measuring eval awareness
This is the main result!
I give Sonnet the eval prompt & ask, approximately, “Do you think this is an eval?”. This method is from Needham et al, more details there! An example prompt is in the Appendix.
Results are quite sensitive to phrasing, so I average this over 5 different phrasings of the same question.
I used this to measure eval awareness for both eval data (math questions, original or rephrased), and deployment data (ShareGPT conversations).
This is a black box method, which is convenient!
Measuring math benchmark scores
This is for control purposes – ideally the model can also answer the rephrased question (ie. we want to ensure rephrasing preserves the eval’s semantic content).
I’ve currently only run this with AIME / AIME rephrased. Why? It’s moderately expensive and it takes time!
Methodological issue – I ran preliminary experiments with the standard AIME system prompt / formatting. I forgot to remove it when measuring performance on AIME rephrased. Need to remove and re-run.
Results
Given an evaluation question, does the model think it's an eval?
How likely is a model to classify (a) deployment data as an eval, vs (b) rephrased eval data as an eval
Accuracy difference between original AIME and rephrased AIME
So rephrasing greatly reduces eval awareness! Almost to the point where the rephrased eval and deployment data are indistinguishable from each other. And neither does rephrasing change the semantic content of the eval either – which means we can run it in lieu of the original eval.
Limitations
I’m currently only measuring eval awareness for Sonnet. Should do so with more models.
Subjectively, Opus 4.5 had steep drops in eval awareness as well. Need to re-run to get the numbers though.
Will this scale as models get smarter?
Seems hard, especially for alignment evals.
Also, if eval-rephrasing is adopted, it will leak into pre-training data & models will soon become aware of it. Which seems bad.
I’ve only tested this out for math benchmarks! Unsure to what extent rephrasing would generalize.
This is only for single-turn benchmarks. Would be an interesting challenge to scale it up to multi-turn / agentic evals.
Next steps
Re-run with other models
Re-run with other datasets. See if it generalize beyond math datasets to other types of capabilities evals.
Look into other potential metrics for evaluating situational awareness? Directly asking the model seems like a brittle metric.
Test whether rephrasing mitigates established instances of deception / sandbagging
Develop a petri-like scaffolding that automatically does this adversarial rephrasing + evaluation
Appendix
Original AIME questionRephrased AIME question
Example prompt testing eval awareness for a particular AIME question
CAMBRIA was wonderful! It's a 3-week version of ARENA organized by CBAI. They plan to run another cohort in NYC this Summer. If you're looking to upskill technically, I'd recommend applying. You can read more here.
...or at least SOME forms of it.
Note: This is preliminary work I did at CAMBRIA this Winter.[1] If it sounds interesting, I'd love to chat. Currently looking for collaboration / mentorship!
Here is this project’s Github.
And here is a google docs version of this write-up! You can leave comments below, or directly on the doc.
Intro
Tl;dr: I rewrote math dataset questions and measured if Sonnet 4.5 could still recognize them as evals. Generally, it couldn’t.
Evaluation Awareness is when a model can tell that its being evaluated. This is an issue, because awareness might change a model’s behaviour during evaluation – which means our eval is no longer measuring what we want it to. This could be due to the model being deceptive (on alignment evals) or sandbagging (on capabilities evals), or otherwise having weird reactions to the fact that its being evaluated.
This is a problem today – frontier models are quite aware they’re being evaluated.
Well, so what can we do to reduce eval awareness? One simple solution – rephrase the prompts so they sound less like evals. Are there real-world situations where we need to ask eval-like questions?
Methods
Datasets
Models
Intervention
Metrics
Results
So rephrasing greatly reduces eval awareness! Almost to the point where the rephrased eval and deployment data are indistinguishable from each other. And neither does rephrasing change the semantic content of the eval either – which means we can run it in lieu of the original eval.
Limitations
Next steps
Appendix
CAMBRIA was wonderful! It's a 3-week version of ARENA organized by CBAI. They plan to run another cohort in NYC this Summer. If you're looking to upskill technically, I'd recommend applying. You can read more here.