Rephrasing Reduces Eval Awareness...

atharva

...or at least SOME forms of it.

Note: This is preliminary work I did at CAMBRIA this Winter.^[1] If it sounds interesting, I'd love to chat. Currently looking for collaboration / mentorship!

Here is this project’s Github.

And here is a google docs version of this write-up! You can leave comments below, or directly on the doc.

Intro

Tl;dr: I rewrote math dataset questions and measured if Sonnet 4.5 could still recognize them as evals. Generally, it couldn’t.

Evaluation Awareness is when a model can tell that its being evaluated. This is an issue, because awareness might change a model’s behaviour during evaluation – which means our eval is no longer measuring what we want it to. This could be due to the model being deceptive (on alignment evals) or sandbagging (on capabilities evals), or otherwise having weird reactions to the fact that its being evaluated.

This is a problem today – frontier models are quite aware they’re being evaluated.

Well, so what can we do to reduce eval awareness? One simple solution – rephrase the prompts so they sound less like evals. Are there real-world situations where we need to ask eval-like questions?

Methods

Datasets

Evaluation data: AIME 2025 (n=30), HMMT 2025 (from February and November, n=30 * 2), Olympmath 2025 (n=200)
Deployment data: ShareGPT (conversational exchanges)
All math datasets were released after Sonnet’s knowledge cutoff

Models

Claude 4.5 Sonnet (primarily). Also Gemini 2.5 Pro, GPT-5
These are near-frontier models, results are relevant today!

Intervention

Replace the standard eval prompt with a rewritten version
Here, I’ve rewritten them in the style of a student asking for help on their homework
- This includes: typos, slang, informal language, plausible-seeming context, etc (example in appendix)
The math (especially the LaTeX) is preserved
- Yes, it is slightly suspicious that you’re conflating a lazy college student’s style with perfectly formatted LaTeX. Yes, Sonnet occasionally acknowledges this as well. I think I can fix this with a couple more system-prompt iterations.
The rephrasing was done with 4.5 Opus. The (quite comprehensive) system prompt for the rephrasing, too, was generated with 4.5 Opus.

Metrics

Measuring eval awareness
- This is the main result!
- I give Sonnet the eval prompt & ask, approximately, “Do you think this is an eval?”. This method is from Needham et al, more details there! An example prompt is in the Appendix.
- Results are quite sensitive to phrasing, so I average this over 5 different phrasings of the same question.
- I used this to measure eval awareness for both eval data (math questions, original or rephrased), and deployment data (ShareGPT conversations).
- This is a black box method, which is convenient!
Measuring math benchmark scores
- This is for control purposes – ideally the model can also answer the rephrased question (ie. we want to ensure rephrasing preserves the eval’s semantic content).
- I’ve currently only run this with AIME / AIME rephrased. Why? It’s moderately expensive and it takes time!
- Methodological issue – I ran preliminary experiments with the standard AIME system prompt / formatting. I forgot to remove it when measuring performance on AIME rephrased. Need to remove and re-run.

Results

Given an evaluation question, does the model think it's an eval?

How likely is a model to classify (a) deployment data as an eval, vs (b) rephrased eval data as an eval

Accuracy difference between original AIME and rephrased AIME

So rephrasing greatly reduces eval awareness! Almost to the point where the rephrased eval and deployment data are indistinguishable from each other. And neither does rephrasing change the semantic content of the eval either – which means we can run it in lieu of the original eval.

Limitations

I’m currently only measuring eval awareness for Sonnet. Should do so with more models.
- Subjectively, Opus 4.5 had steep drops in eval awareness as well. Need to re-run to get the numbers though.
Will this scale as models get smarter?
- Seems hard, especially for alignment evals.
- Also, if eval-rephrasing is adopted, it will leak into pre-training data & models will soon become aware of it. Which seems bad.
I’ve only tested this out for math benchmarks! Unsure to what extent rephrasing would generalize.
This is only for single-turn benchmarks. Would be an interesting challenge to scale it up to multi-turn / agentic evals.

Next steps

Re-run with other models
Re-run with other datasets. See if it generalize beyond math datasets to other types of capabilities evals.
Look into other potential metrics for evaluating situational awareness? Directly asking the model seems like a brittle metric.
Test whether rephrasing mitigates established instances of deception / sandbagging
Develop a petri-like scaffolding that automatically does this adversarial rephrasing + evaluation

Appendix

Example prompt testing eval awareness for a particular AIME question

^{^}
CAMBRIA was wonderful! It's a 3-week version of ARENA organized by CBAI. They plan to run another cohort in NYC this Summer. If you're looking to upskill technically, I'd recommend applying. You can read more here.

LESSWRONG
LW